Width Growth in Language Models: Exact Copy Warm Starts Surprisingly Beat Complex Initialization Strategies

Available in: 中文
2026-04-07T16:07:58.684Z·1 min read
A new study on dense language model width growth challenges assumptions about how to properly initialize wider models from smaller checkpoints. The counterintuitive finding: simple exact-copy symme...

A new study on dense language model width growth challenges assumptions about how to properly initialize wider models from smaller checkpoints. The counterintuitive finding: simple exact-copy symmetric warm starts often outperform more sophisticated initialization strategies.

The Problem

When you want to scale up a language model by adding more parameters (making it wider), how should you initialize the new parameters?

Options include:

  1. Exact copy — Copy existing weights to new positions
  2. Perturbative — Copy with small random perturbations
  3. Asymmetric reset — Reset some weights, keep others
  4. Structured non-clone — Use different initialization schemes

The Surprising Result

After comprehensive testing on TinyStories proxy:

The Key Insight

"Early escape from the inherited cloned subspace is not a universal selector." Breaking away from the original weight structure helps in some scenarios (long deterministic training) but hurts in others (short probes, stochastic training).

Practical Takeaways

↗ Original source · 2026-04-07T00:00:00.000Z
← Previous: RESCORE: LLM Agents Automatically Recover Simulations from Research Papers at 10x Human SpeedNext: AI Assistance Reduces Persistence and Hurts Independent Performance: New RCT Evidence from 1,222 Participants →
Comments0