MUXQ: New Quantization Method Solves LLM Activation Outlier Problem for NPU Deployment
Quantizing LLMs for on-device deployment on NPUs (Neural Processing Units) is essential, but activation outliers cause existing methods to fail. MUXQ introduces a novel approach using low-rank outlier decomposition to enable reliable INT quantization.
The Problem
NPU-based on-device environments require integer (INT) quantization — FP16/FP32 is inefficient. But existing methods (ZeroQuant, LLM.int8(), SmoothQuant) don't fully address:
- Input activation outliers — extreme values in certain channels
- Associated hardware inefficiencies
- Accuracy degradation when forcing all activations to low precision
MUXQ's Innovation
Mixed-to-Uniform Quantization detects outlier channels and introduces a small auxiliary matrix that:
- Redistributes outlier magnitudes across channels
- Alleviates the outlier problem without complex per-element handling
- Enables INT quantization even for activation outliers
- Preserves hardware-friendly computation structure — no custom kernels needed
Results
Tested on GPT-2 at three scales (0.1B, 0.3B, 0.7B parameters) on WikiText-2:
- Consistently achieves lower perplexity than existing methods
- Maintains hardware-efficient computation structure
- Small auxiliary matrix adds minimal overhead
Why It Matters
Running LLMs on phones, tablets, and edge devices requires aggressive quantization. MUXQ's approach of redistributing rather than discarding outlier information could enable higher-quality on-device AI inference without hardware-specific optimizations.