Grounded Token Initialization: Fixing a Key Bottleneck in Extending Language Models with New Vocabulary

2026-04-04T00:40:38.667Z·1 min read

A new study from researchers reveals that the standard practice of initializing new vocabulary tokens as the mean of existing embeddings creates a critical bottleneck when extending language models...

A new study from researchers reveals that the standard practice of initializing new vocabulary tokens as the mean of existing embeddings creates a critical bottleneck when extending language models for domain-specific tasks.

The Problem

When extending LMs with new tokens (e.g., Semantic-ID tokens for recommendation), the standard approach:

Initializes new tokens as the mean of existing vocabulary embeddings
Relies on fine-tuning to learn representations

Through spectral and geometric diagnostics, the researchers show this collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that fine-tuning struggles to recover.

The Solution: Grounded Token Initialization (GTI)

The paper proposes the Grounded Token Initialization Hypothesis: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning enables the model to leverage its general-purpose knowledge.

GTI is a lightweight grounding stage that:

Maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space
Uses only paired linguistic supervision (no additional training data needed)
Runs before fine-tuning

Results

GTI outperforms both mean initialization and existing auxiliary-task adaptation methods across multiple generative recommendation benchmarks, including:

Industry-scale datasets
Public benchmark datasets

Analysis confirms grounded embeddings produce richer inter-token structure that persists through fine-tuning.

Why This Matters

Generative recommendation: Better Semantic-ID token representations
Domain adaptation: Any scenario requiring vocabulary extension
Efficiency: Lightweight pre-finetuning step with outsized impact
Theory: Provides diagnostic tools for understanding embedding quality

Paper: arXiv:2604.02324

↗ Original source · 2026-04-03T00:00:00.000Z

Comments0

Grounded Token Initialization: Fixing a Key Bottleneck in Extending Language Models with New Vocabulary

The Problem

The Solution: Grounded Token Initialization (GTI)

Results

Why This Matters

Related Articles