Scaling Karpathy's Autoresearch: An AI Agent Ran 910 ML Experiments in 8 Hours Across 16 GPUs

2026-03-19T18:02:21.000Z·4 min read
SkyPilot gave Claude Code access to a 16-GPU Kubernetes cluster running Karpathy's autoresearch framework. The agent autonomously ran ~910 experiments in 8 hours, discovered that model width was the most important hyperparameter, learned to exploit heterogeneous hardware (screen on H100s, validate on H200s), and achieved a 2.87% improvement over baseline.

From One GPU to Sixteen

Andrej Karpathy's autoresearch project demonstrated that an AI coding agent could autonomously improve a neural network training script — editing train.py, running experiments, and keeping changes that improved validation loss. In his first run, the agent found ~20 improvements that stacked up to an 11% reduction in time-to-GPT-2.

The bottleneck? One GPU, one experiment at a time. About 12 experiments per hour.

SkyPilot asked: what happens when you remove the infrastructure bottleneck and let the agent manage its own compute?

The Experiment

They pointed Claude Code at autoresearch and gave it access to 16 GPUs on a Kubernetes cluster (mix of H100s and H200s). The ground rules:

Results: 910 Experiments in 8 Hours

MetricSequential (1 GPU)Parallel (16 GPUs)
Experiments~910~910
Time~72 hours (simulated)~8 hours (actual)
Speedup1x baseline9x faster
Best val_bpb0.9740.974
Improvement2.87% over baseline

Five Research Phases

The agent's work naturally organized into distinct phases:

Phase 1: Hyperparameter Sweeps (~200 experiments)

Systematic exploration of learning rates, batch sizes, and other standard parameters.

Phase 2: Architecture Discovery (200-420)

Key finding: model width mattered more than any single hyperparameter. The agent tested six model widths in a single parallel wave, saw the trend immediately, and zeroed in on the best — one round instead of six sequential rounds.

Phase 3: Fine-tuning the Wider Model (420-560)

Focused optimization of the best architecture found in Phase 2.

Phase 4: Optimizer Tuning (560-700)

Exploring different optimizer configurations for the chosen architecture.

Phase 5: Diminishing Returns (700-910)

The agent continued searching but found fewer improvements — a natural signal to stop.

Emergent Behavior: Heterogeneous Hardware Exploitation

Perhaps the most fascinating finding was an emergent strategy the agent developed on its own:

"The agent discovered it had access to multiple GPU types (H100s and H200s) and developed a strategy to exploit the performance difference: screen ideas on cheap H100s, promote winners to H200s for validation."

No human told it to do this. It observed the available hardware, inferred the cost-performance tradeoffs, and developed an efficient tiered evaluation strategy — the same approach human ML researchers use when managing GPU budgets.

How Parallelism Changed the Agent's Strategy

With one GPU, the agent was stuck doing greedy hill-climbing: try one thing, check, repeat. With 16 GPUs, its strategy fundamentally changed:

Cost Analysis

The total compute cost for 910 × 5-minute experiments across 16 GPUs:

Compare this to a human ML researcher running the same experiments: ~72 hours of wall time plus the cognitive overhead of managing 910 experiments.

What This Means

For ML Research

For the AI Industry

For Infrastructure

The Road Ahead

This is still GPT-2-scale research. The real test will be:

  1. Can agents do this at GPT-3/GPT-4 scale?
  2. Can they discover novel architectures, not just optimize existing ones?
  3. Can they write research papers about their findings?
  4. What happens when you give them 160 GPUs? 1,600?

The answer to the last question may tell us more about the future of AI research than any paper published this year.

Source: SkyPilot Blog

↗ Original source
← Previous: Xiaomi Unveils Next-Generation SU7 EV: V6s Plus Super Motor, 902km Range, and AI IntegrationNext: Bank of England Deletes 'Rate Cut' Language as Global Central Banks Tighten Stance on Inflation →
Comments0