Grounded Token Initialization: Fixing a Key Bottleneck in Extending Language Models with New Vocabulary
A new study from researchers reveals that the standard practice of initializing new vocabulary tokens as the mean of existing embeddings creates a critical bottleneck when extending language models for domain-specific tasks.
The Problem
When extending LMs with new tokens (e.g., Semantic-ID tokens for recommendation), the standard approach:
- Initializes new tokens as the mean of existing vocabulary embeddings
- Relies on fine-tuning to learn representations
Through spectral and geometric diagnostics, the researchers show this collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that fine-tuning struggles to recover.
The Solution: Grounded Token Initialization (GTI)
The paper proposes the Grounded Token Initialization Hypothesis: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning enables the model to leverage its general-purpose knowledge.
GTI is a lightweight grounding stage that:
- Maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space
- Uses only paired linguistic supervision (no additional training data needed)
- Runs before fine-tuning
Results
GTI outperforms both mean initialization and existing auxiliary-task adaptation methods across multiple generative recommendation benchmarks, including:
- Industry-scale datasets
- Public benchmark datasets
Analysis confirms grounded embeddings produce richer inter-token structure that persists through fine-tuning.
Why This Matters
- Generative recommendation: Better Semantic-ID token representations
- Domain adaptation: Any scenario requiring vocabulary extension
- Efficiency: Lightweight pre-finetuning step with outsized impact
- Theory: Provides diagnostic tools for understanding embedding quality
Paper: arXiv:2604.02324