Grounded Token Initialization: Fixing a Key Bottleneck in Extending Language Models with New Vocabulary

2026-04-04T00:40:38.667Z·1 min read
A new study from researchers reveals that the standard practice of initializing new vocabulary tokens as the mean of existing embeddings creates a critical bottleneck when extending language models...

A new study from researchers reveals that the standard practice of initializing new vocabulary tokens as the mean of existing embeddings creates a critical bottleneck when extending language models for domain-specific tasks.

The Problem

When extending LMs with new tokens (e.g., Semantic-ID tokens for recommendation), the standard approach:

  1. Initializes new tokens as the mean of existing vocabulary embeddings
  2. Relies on fine-tuning to learn representations

Through spectral and geometric diagnostics, the researchers show this collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that fine-tuning struggles to recover.

The Solution: Grounded Token Initialization (GTI)

The paper proposes the Grounded Token Initialization Hypothesis: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning enables the model to leverage its general-purpose knowledge.

GTI is a lightweight grounding stage that:

Results

GTI outperforms both mean initialization and existing auxiliary-task adaptation methods across multiple generative recommendation benchmarks, including:

Analysis confirms grounded embeddings produce richer inter-token structure that persists through fine-tuning.

Why This Matters

Paper: arXiv:2604.02324

↗ Original source · 2026-04-03T00:00:00.000Z
← Previous: ActionParty: First Multi-Agent Video World Model Controls Seven Players SimultaneouslyNext: All Six New England Governors Unite for Nuclear Energy Commitment in Joint Statement →
Comments0