Width Growth in Language Models: Exact Copy Warm Starts Surprisingly Beat Complex Initialization Strategies
Available in: 中文
A new study on dense language model width growth challenges assumptions about how to properly initialize wider models from smaller checkpoints. The counterintuitive finding: simple exact-copy symme...
A new study on dense language model width growth challenges assumptions about how to properly initialize wider models from smaller checkpoints. The counterintuitive finding: simple exact-copy symmetric warm starts often outperform more sophisticated initialization strategies.
The Problem
When you want to scale up a language model by adding more parameters (making it wider), how should you initialize the new parameters?
Options include:
- Exact copy — Copy existing weights to new positions
- Perturbative — Copy with small random perturbations
- Asymmetric reset — Reset some weights, keep others
- Structured non-clone — Use different initialization schemes
The Surprising Result
After comprehensive testing on TinyStories proxy:
- Exact-copy wins most metrics — Ranks first in every completed 16-step probe and stochastic 128-step continuation
- But not always — Structured non-clone wins deterministic long continuation (128 steps)
- The picture is mixed — No single strategy dominates across all scenarios
The Key Insight
"Early escape from the inherited cloned subspace is not a universal selector." Breaking away from the original weight structure helps in some scenarios (long deterministic training) but hurts in others (short probes, stochastic training).
Practical Takeaways
- Default to exact copy — It's simple, it's fast, and it wins most benchmarks
- Consider alternatives — For specific long-horizon deterministic training, structured initialization may help
- Width growth is feasible — Reusing smaller model checkpoints is a practical scaling strategy
- Regime sensitivity matters — The right initialization depends on your specific training setup
← Previous: RESCORE: LLM Agents Automatically Recover Simulations from Research Papers at 10x Human SpeedNext: AI Assistance Reduces Persistence and Hurts Independent Performance: New RCT Evidence from 1,222 Participants →
0