ROTATE: Data-Free Method Disentangles LLM Neurons in Weight Space Using Vocabulary Kurtosis

2026-04-08T08:37:50.141Z·1 min read

Researchers have developed ROTATE (Rotation-Optimized Token Alignment in weighT spacE), a data-free method that disentangles what individual neurons in LLMs encode by analyzing their weights direct...

ROTATE: Understanding LLM Neurons Without Running the Model — A Breakthrough in Mechanistic Interpretability

Researchers have developed ROTATE (Rotation-Optimized Token Alignment in weighT spacE), a data-free method that disentangles what individual neurons in LLMs encode by analyzing their weights directly — without any forward passes.

The Key Insight

Neurons that encode coherent, monosemantic concepts exhibit high kurtosis when projected onto the model's vocabulary. By optimizing rotations of neuron weights to maximize this kurtosis, ROTATE recovers interpretable "vocabulary channels."

What Makes ROTATE Special

Feature	Traditional Methods	ROTATE
Data required	Large datasets	None (data-free)
Forward passes	Yes	No
Computational cost	High	Low
Scalability	Limited	Scales to large models

Results

Tested on Llama-3.1-8B-Instruct and Gemma-2-2B-it:

Recovers vocabulary channels faithful to neuron behavior
Ablating individual channels selectively disables specific input activations or concepts
Aggregated channel descriptions outperform activation-based baselines by 2-3x in head-to-head comparisons

Why This Matters

Mechanistic interpretability at scale — Understanding 8B+ parameter models without expensive compute
Safety research — Identifying what neurons encode helps detect dangerous capabilities
Model editing — Knowing which channels control which concepts enables precise modifications
No data needed — Privacy-preserving interpretability (no training data required)

↗ Original source · 2026-04-08T00:00:00.000Z

Comments0