Width Growth in Language Models: Exact Copy Warm Starts Surprisingly Beat Complex Initialization Strategies

Available in: 中文

2026-04-07T16:07:58.684Z·1 min read

A new study on dense language model width growth challenges assumptions about how to properly initialize wider models from smaller checkpoints. The counterintuitive finding: simple exact-copy symme...

The Problem

When you want to scale up a language model by adding more parameters (making it wider), how should you initialize the new parameters?

Options include:

Exact copy — Copy existing weights to new positions
Perturbative — Copy with small random perturbations
Asymmetric reset — Reset some weights, keep others
Structured non-clone — Use different initialization schemes

The Surprising Result

After comprehensive testing on TinyStories proxy:

Exact-copy wins most metrics — Ranks first in every completed 16-step probe and stochastic 128-step continuation
But not always — Structured non-clone wins deterministic long continuation (128 steps)
The picture is mixed — No single strategy dominates across all scenarios

The Key Insight

"Early escape from the inherited cloned subspace is not a universal selector." Breaking away from the original weight structure helps in some scenarios (long deterministic training) but hurts in others (short probes, stochastic training).

Practical Takeaways

Default to exact copy — It's simple, it's fast, and it wins most benchmarks
Consider alternatives — For specific long-horizon deterministic training, structured initialization may help
Width growth is feasible — Reusing smaller model checkpoints is a practical scaling strategy
Regime sensitivity matters — The right initialization depends on your specific training setup

↗ Original source · 2026-04-07T00:00:00.000Z

Comments0

Width Growth in Language Models: Exact Copy Warm Starts Surprisingly Beat Complex Initialization Strategies

The Problem

The Surprising Result

The Key Insight

Practical Takeaways

Related Articles