Scaling Karpathy's Autoresearch: An AI Agent Ran 910 ML Experiments in 8 Hours Across 16 GPUs
From One GPU to Sixteen
Andrej Karpathy's autoresearch project demonstrated that an AI coding agent could autonomously improve a neural network training script — editing train.py, running experiments, and keeping changes that improved validation loss. In his first run, the agent found ~20 improvements that stacked up to an 11% reduction in time-to-GPT-2.
The bottleneck? One GPU, one experiment at a time. About 12 experiments per hour.
SkyPilot asked: what happens when you remove the infrastructure bottleneck and let the agent manage its own compute?
The Experiment
They pointed Claude Code at autoresearch and gave it access to 16 GPUs on a Kubernetes cluster (mix of H100s and H200s). The ground rules:
- prepare.py — Read-only, the agent cannot touch it
- train.py — The only file the agent modifies (GPT model, optimizer, training loop)
- program.md — Instructions for what to change and how to evaluate
- Each experiment: 5-minute training run on a GPU
Results: 910 Experiments in 8 Hours
| Metric | Sequential (1 GPU) | Parallel (16 GPUs) |
|---|---|---|
| Experiments | ~910 | ~910 |
| Time | ~72 hours (simulated) | ~8 hours (actual) |
| Speedup | 1x baseline | 9x faster |
| Best val_bpb | 0.974 | 0.974 |
| Improvement | — | 2.87% over baseline |
Five Research Phases
The agent's work naturally organized into distinct phases:
Phase 1: Hyperparameter Sweeps (~200 experiments)
Systematic exploration of learning rates, batch sizes, and other standard parameters.
Phase 2: Architecture Discovery (200-420)
Key finding: model width mattered more than any single hyperparameter. The agent tested six model widths in a single parallel wave, saw the trend immediately, and zeroed in on the best — one round instead of six sequential rounds.
Phase 3: Fine-tuning the Wider Model (420-560)
Focused optimization of the best architecture found in Phase 2.
Phase 4: Optimizer Tuning (560-700)
Exploring different optimizer configurations for the chosen architecture.
Phase 5: Diminishing Returns (700-910)
The agent continued searching but found fewer improvements — a natural signal to stop.
Emergent Behavior: Heterogeneous Hardware Exploitation
Perhaps the most fascinating finding was an emergent strategy the agent developed on its own:
"The agent discovered it had access to multiple GPU types (H100s and H200s) and developed a strategy to exploit the performance difference: screen ideas on cheap H100s, promote winners to H200s for validation."
No human told it to do this. It observed the available hardware, inferred the cost-performance tradeoffs, and developed an efficient tiered evaluation strategy — the same approach human ML researchers use when managing GPU budgets.
How Parallelism Changed the Agent's Strategy
With one GPU, the agent was stuck doing greedy hill-climbing: try one thing, check, repeat. With 16 GPUs, its strategy fundamentally changed:
- Factorial grids of 10-13 experiments per wave
- Interaction effects between parameters caught in parallel
- Wider exploration — testing six options simultaneously instead of sequentially
- Faster convergence — reaching the same result in 1/9th the time
Cost Analysis
The total compute cost for 910 × 5-minute experiments across 16 GPUs:
- ~758 GPU-hours of compute
- Estimated cost: ~$15,000-$25,000 (cloud GPU pricing)
- Human labor: $0 — the agent ran autonomously
Compare this to a human ML researcher running the same experiments: ~72 hours of wall time plus the cognitive overhead of managing 910 experiments.
What This Means
For ML Research
- AI agents can now autonomously conduct real ML research at scale
- The bottleneck has shifted from human experimentation speed to agent context and reasoning quality
- Parallelism doesn't just make research faster — it changes how research is done
For the AI Industry
- Claude Code is proving itself as a serious research tool, not just a code assistant
- The 9x speedup from 16 GPUs suggests that AI research itself may be the next domain to benefit from massive parallelism
- Karpathy's autoresearch is becoming a benchmark for agent capabilities
For Infrastructure
- SkyPilot's thesis — that managing multi-cloud GPU infrastructure should be invisible — gets a powerful validation
- The agent managed its own compute without human infrastructure intervention
- Heterogeneous hardware awareness emerged naturally
The Road Ahead
This is still GPT-2-scale research. The real test will be:
- Can agents do this at GPT-3/GPT-4 scale?
- Can they discover novel architectures, not just optimize existing ones?
- Can they write research papers about their findings?
- What happens when you give them 160 GPUs? 1,600?
The answer to the last question may tell us more about the future of AI research than any paper published this year.
Source: SkyPilot Blog