MUXQ: New Quantization Method Solves LLM Activation Outlier Problem for NPU Deployment

Available in: 中文
2026-04-07T19:54:14.181Z·1 min read
Quantizing LLMs for on-device deployment on NPUs (Neural Processing Units) is essential, but activation outliers cause existing methods to fail. MUXQ introduces a novel approach using low-rank outl...

Quantizing LLMs for on-device deployment on NPUs (Neural Processing Units) is essential, but activation outliers cause existing methods to fail. MUXQ introduces a novel approach using low-rank outlier decomposition to enable reliable INT quantization.

The Problem

NPU-based on-device environments require integer (INT) quantization — FP16/FP32 is inefficient. But existing methods (ZeroQuant, LLM.int8(), SmoothQuant) don't fully address:

MUXQ's Innovation

Mixed-to-Uniform Quantization detects outlier channels and introduces a small auxiliary matrix that:

  1. Redistributes outlier magnitudes across channels
  2. Alleviates the outlier problem without complex per-element handling
  3. Enables INT quantization even for activation outliers
  4. Preserves hardware-friendly computation structure — no custom kernels needed

Results

Tested on GPT-2 at three scales (0.1B, 0.3B, 0.7B parameters) on WikiText-2:

Why It Matters

Running LLMs on phones, tablets, and edge devices requires aggressive quantization. MUXQ's approach of redistributing rather than discarding outlier information could enable higher-quality on-device AI inference without hardware-specific optimizations.

↗ Original source · 2026-04-07T00:00:00.000Z
← Previous: Federated Unlearning Made Practical: First Complete Pipeline with Visual Evaluation FrameworkNext: Spectroscopy ML Warning: Near-Perfect Accuracy Can Be Completely Misleading Due to High-Dimensional Data →
Comments0