REAM: Merging Instead of Pruning Mixture-of-Experts Preserves Performance While Cutting Memory

Available in: 中文

2026-04-07T16:07:53.324Z·1 min read

Mixture-of-Experts models like Mixtral and DeepSeek are among the top-performing LLM architectures, but with hundreds of billions of parameters, they pose massive memory challenges:

A new technique called REAM (Router-weighted Expert Activation Merging) challenges the conventional approach of pruning experts in Mixture-of-Experts (MoE) large language models. Instead of removing experts entirely, REAM groups and merges their weights, better preserving original model performance.

The MoE Problem

Mixture-of-Experts models like Mixtral and DeepSeek are among the top-performing LLM architectures, but with hundreds of billions of parameters, they pose massive memory challenges:

Deployment cost — Full MoE models require multiple GPUs
Inference latency — Large parameter count slows generation
Traditional solutions — Weight pruning and quantization, but these sacrifice quality

REAM's Innovation

Previous work (REAP) pruned experts — removing them entirely. REAM takes a different approach:

Group similar experts — Find experts that activate on similar inputs
Merge their weights — Combine expert parameters weighted by router scores
Preserve knowledge — No information is thrown away, just compressed

The Key Finding: MC vs GEN Tradeoff

The research reveals an important trade-off between:

Multiple choice performance — Question answering accuracy
Generative performance — Open-ended text generation quality

The balance depends on calibration data mix. By controlling the ratio of general, math, and coding data in calibration, REAM navigates this Pareto frontier effectively.

Results

Often outperforms pruning baselines (REAP and others)
In many cases comparable to original uncompressed models
Significant memory reduction achieved
Open source code available

Why This Matters

MoE models are becoming the dominant architecture for frontier LLMs. Better compression techniques like REAM could make these models practical for deployment on consumer hardware and edge devices.

↗ Original source · 2026-04-07T00:00:00.000Z

Comments0