MUXQ: New Quantization Method Solves LLM Activation Outlier Problem for NPU Deployment

Available in: 中文

2026-04-07T19:54:14.181Z·1 min read

Quantizing LLMs for on-device deployment on NPUs (Neural Processing Units) is essential, but activation outliers cause existing methods to fail. MUXQ introduces a novel approach using low-rank outl...

The Problem

NPU-based on-device environments require integer (INT) quantization — FP16/FP32 is inefficient. But existing methods (ZeroQuant, LLM.int8(), SmoothQuant) don't fully address:

Input activation outliers — extreme values in certain channels
Associated hardware inefficiencies
Accuracy degradation when forcing all activations to low precision

MUXQ's Innovation

Mixed-to-Uniform Quantization detects outlier channels and introduces a small auxiliary matrix that:

Redistributes outlier magnitudes across channels
Alleviates the outlier problem without complex per-element handling
Enables INT quantization even for activation outliers
Preserves hardware-friendly computation structure — no custom kernels needed

Results

Tested on GPT-2 at three scales (0.1B, 0.3B, 0.7B parameters) on WikiText-2:

Consistently achieves lower perplexity than existing methods
Maintains hardware-efficient computation structure
Small auxiliary matrix adds minimal overhead

Why It Matters

Running LLMs on phones, tablets, and edge devices requires aggressive quantization. MUXQ's approach of redistributing rather than discarding outlier information could enable higher-quality on-device AI inference without hardware-specific optimizations.

↗ Original source · 2026-04-07T00:00:00.000Z

Comments0

MUXQ: New Quantization Method Solves LLM Activation Outlier Problem for NPU Deployment

The Problem

MUXQ's Innovation

Results

Why It Matters

Related Articles