Darkness Visible: GPT-2's Final MLP Layer Decoded as a 27-Neuron Exception Handler
Available in: 中文
A remarkable mechanistic interpretability study has fully decoded the final MLP layer of GPT-2 Small into 27 named neurons organized as a three-tier exception handler. The research reveals how a ti...
A remarkable mechanistic interpretability study has fully decoded the final MLP layer of GPT-2 Small into 27 named neurons organized as a three-tier exception handler. The research reveals how a tiny neural circuit routes knowledge without storing it.
The Discovery
The final MLP (Multi-Layer Perceptron) in GPT-2 Small's last layer isn't storing knowledge — it's routing it. All 3,072 neurons decompose into:
| Component | Count | Function |
|---|---|---|
| Core neurons | 5 | Reset vocabulary toward function words |
| Differentiators | 10 | Suppress wrong candidates |
| Specialists | 5 | Detect structural boundaries |
| Consensus neurons | 7 | Monitor distinct linguistic dimensions |
The Exception Handler Model
The neurons work as a three-tier system:
- Default path — Core neurons establish baseline function word usage
- Exception detection — Specialists identify when structural boundaries require different handling
- Consensus voting — 7 neurons each monitor a linguistic dimension; the crossover point (4-5 of 7 agreeing) statistically sharply determines whether MLP intervention helps or harms
Key Insight: Routing, Not Storage
The study challenges the popular "knowledge neurons" concept. The so-called knowledge neurons at layer 11 of GPT-2 function as routing infrastructure rather than fact storage. They amplify or suppress signals already present in the residual stream from attention layers.
Practical Implications
- Model editing — Understanding routing infrastructure enables more precise interventions
- Efficiency — The entire routing program uses only 27 neurons out of ~3,040 total
- Garden-path reversal — Experiments show the system can dynamically reverse its processing direction
← Previous: Bidirectional Entropy Modulation: Rethinking Exploration in Reinforcement Learning for LLM ReasoningNext: Data Attribution in Adaptive Learning: Why Standard Methods Fail When AI Generates Its Own Training Data →
0