Scaling Karpathy's Autoresearch: An AI Agent Ran 910 ML Experiments in 8 Hours Across 16 GPUs

2026-03-19T18:02:21.000Z·4 min read

SkyPilot gave Claude Code access to a 16-GPU Kubernetes cluster running Karpathy's autoresearch framework. The agent autonomously ran ~910 experiments in 8 hours, discovered that model width was the most important hyperparameter, learned to exploit heterogeneous hardware (screen on H100s, validate on H200s), and achieved a 2.87% improvement over baseline.

From One GPU to Sixteen

Andrej Karpathy's autoresearch project demonstrated that an AI coding agent could autonomously improve a neural network training script — editing train.py, running experiments, and keeping changes that improved validation loss. In his first run, the agent found ~20 improvements that stacked up to an 11% reduction in time-to-GPT-2.

The bottleneck? One GPU, one experiment at a time. About 12 experiments per hour.

SkyPilot asked: what happens when you remove the infrastructure bottleneck and let the agent manage its own compute?

The Experiment

They pointed Claude Code at autoresearch and gave it access to 16 GPUs on a Kubernetes cluster (mix of H100s and H200s). The ground rules:

prepare.py — Read-only, the agent cannot touch it
train.py — The only file the agent modifies (GPT model, optimizer, training loop)
program.md — Instructions for what to change and how to evaluate
Each experiment: 5-minute training run on a GPU

Results: 910 Experiments in 8 Hours

Metric	Sequential (1 GPU)	Parallel (16 GPUs)
Experiments	~910	~910
Time	~72 hours (simulated)	~8 hours (actual)
Speedup	1x baseline	9x faster
Best val_bpb	0.974	0.974
Improvement	—	2.87% over baseline

Five Research Phases

The agent's work naturally organized into distinct phases:

Phase 1: Hyperparameter Sweeps (~200 experiments)

Systematic exploration of learning rates, batch sizes, and other standard parameters.

Phase 2: Architecture Discovery (200-420)

Key finding: model width mattered more than any single hyperparameter. The agent tested six model widths in a single parallel wave, saw the trend immediately, and zeroed in on the best — one round instead of six sequential rounds.

Phase 3: Fine-tuning the Wider Model (420-560)

Focused optimization of the best architecture found in Phase 2.

Phase 4: Optimizer Tuning (560-700)

Exploring different optimizer configurations for the chosen architecture.

Phase 5: Diminishing Returns (700-910)

The agent continued searching but found fewer improvements — a natural signal to stop.

Emergent Behavior: Heterogeneous Hardware Exploitation

Perhaps the most fascinating finding was an emergent strategy the agent developed on its own:

"The agent discovered it had access to multiple GPU types (H100s and H200s) and developed a strategy to exploit the performance difference: screen ideas on cheap H100s, promote winners to H200s for validation."

No human told it to do this. It observed the available hardware, inferred the cost-performance tradeoffs, and developed an efficient tiered evaluation strategy — the same approach human ML researchers use when managing GPU budgets.

How Parallelism Changed the Agent's Strategy

With one GPU, the agent was stuck doing greedy hill-climbing: try one thing, check, repeat. With 16 GPUs, its strategy fundamentally changed:

Factorial grids of 10-13 experiments per wave
Interaction effects between parameters caught in parallel
Wider exploration — testing six options simultaneously instead of sequentially
Faster convergence — reaching the same result in 1/9th the time

Cost Analysis

The total compute cost for 910 × 5-minute experiments across 16 GPUs:

~758 GPU-hours of compute
Estimated cost: ~$15,000-$25,000 (cloud GPU pricing)
Human labor: $0 — the agent ran autonomously

Compare this to a human ML researcher running the same experiments: ~72 hours of wall time plus the cognitive overhead of managing 910 experiments.

What This Means

For ML Research

AI agents can now autonomously conduct real ML research at scale
The bottleneck has shifted from human experimentation speed to agent context and reasoning quality
Parallelism doesn't just make research faster — it changes how research is done

For the AI Industry

Claude Code is proving itself as a serious research tool, not just a code assistant
The 9x speedup from 16 GPUs suggests that AI research itself may be the next domain to benefit from massive parallelism
Karpathy's autoresearch is becoming a benchmark for agent capabilities

For Infrastructure

SkyPilot's thesis — that managing multi-cloud GPU infrastructure should be invisible — gets a powerful validation
The agent managed its own compute without human infrastructure intervention
Heterogeneous hardware awareness emerged naturally

The Road Ahead

This is still GPT-2-scale research. The real test will be:

Can agents do this at GPT-3/GPT-4 scale?
Can they discover novel architectures, not just optimize existing ones?
Can they write research papers about their findings?
What happens when you give them 160 GPUs? 1,600?

The answer to the last question may tell us more about the future of AI research than any paper published this year.

Source: SkyPilot Blog

↗ Original source

Comments0