Attention Editing: Convert LLM Attention Architectures Without Retraining from Scratch
Attention Editing: Swap LLM Attention Architectures Mid-Flight Without Full Retraining
Researchers have developed Attention Editing, a framework that converts already-trained LLMs to new attention architectures — such as multi-head latent attention (MLA) and hybrid sliding-window attention (SWA) — without expensive re-pretraining from scratch.
The Problem
KV cache memory and bandwidth increasingly dominate LLM inference costs, especially for long-context applications. New attention architectures like MLA and SWA can dramatically reduce this cost, but integrating them into existing models requires full retraining.
Attention Editing's Solution
The framework replaces the original attention module with a learnable target and trains it using progressive distillation:
- Layer-wise teacher-forced optimization — With intermediate activation supervision to prevent cold-start error accumulation
- Model-level distillation — On next-token distributions, optionally regularized by weak feature matching
Practical Results
Applied to Qwen3-8B and Qwen3-30B-A3B with two target architectures:
| Target | Description |
|---|---|
| MLA | Multi-head Latent Attention (used by DeepSeek) |
| GateSWA | Gated hybrid sliding-window attention (novel design) |
The resulting models maintain competitive performance while delivering substantial efficiency improvements.
Notable: Ascend 910B
Experiments were conducted on Huawei Ascend 910B clusters, providing a practical case study for training LLMs on domestic (Chinese) hardware — increasingly important given US export restrictions.
Why This Matters
- Cost reduction — Swap to efficient attention without retraining from scratch
- Flexibility — Upgrade existing models as new attention architectures emerge
- Hardware diversity — Demonstrates LLM training viability on non-NVIDIA hardware