Attention Editing: Convert LLM Attention Architectures Without Retraining from Scratch

2026-04-08T08:31:30.248Z·1 min read

Researchers have developed Attention Editing, a framework that converts already-trained LLMs to new attention architectures — such as multi-head latent attention (MLA) and hybrid sliding-window att...

Attention Editing: Swap LLM Attention Architectures Mid-Flight Without Full Retraining

Researchers have developed Attention Editing, a framework that converts already-trained LLMs to new attention architectures — such as multi-head latent attention (MLA) and hybrid sliding-window attention (SWA) — without expensive re-pretraining from scratch.

The Problem

KV cache memory and bandwidth increasingly dominate LLM inference costs, especially for long-context applications. New attention architectures like MLA and SWA can dramatically reduce this cost, but integrating them into existing models requires full retraining.

Attention Editing's Solution

The framework replaces the original attention module with a learnable target and trains it using progressive distillation:

Layer-wise teacher-forced optimization — With intermediate activation supervision to prevent cold-start error accumulation
Model-level distillation — On next-token distributions, optionally regularized by weak feature matching

Practical Results

Applied to Qwen3-8B and Qwen3-30B-A3B with two target architectures:

Target	Description
MLA	Multi-head Latent Attention (used by DeepSeek)
GateSWA	Gated hybrid sliding-window attention (novel design)

The resulting models maintain competitive performance while delivering substantial efficiency improvements.

Notable: Ascend 910B

Experiments were conducted on Huawei Ascend 910B clusters, providing a practical case study for training LLMs on domestic (Chinese) hardware — increasingly important given US export restrictions.

Why This Matters

Cost reduction — Swap to efficient attention without retraining from scratch
Flexibility — Upgrade existing models as new attention architectures emerge
Hardware diversity — Demonstrates LLM training viability on non-NVIDIA hardware

↗ Original source · 2026-04-08T00:00:00.000Z

Comments0