REAM: Merging Instead of Pruning Mixture-of-Experts Preserves Performance While Cutting Memory
A new technique called REAM (Router-weighted Expert Activation Merging) challenges the conventional approach of pruning experts in Mixture-of-Experts (MoE) large language models. Instead of removing experts entirely, REAM groups and merges their weights, better preserving original model performance.
The MoE Problem
Mixture-of-Experts models like Mixtral and DeepSeek are among the top-performing LLM architectures, but with hundreds of billions of parameters, they pose massive memory challenges:
- Deployment cost — Full MoE models require multiple GPUs
- Inference latency — Large parameter count slows generation
- Traditional solutions — Weight pruning and quantization, but these sacrifice quality
REAM's Innovation
Previous work (REAP) pruned experts — removing them entirely. REAM takes a different approach:
- Group similar experts — Find experts that activate on similar inputs
- Merge their weights — Combine expert parameters weighted by router scores
- Preserve knowledge — No information is thrown away, just compressed
The Key Finding: MC vs GEN Tradeoff
The research reveals an important trade-off between:
- Multiple choice performance — Question answering accuracy
- Generative performance — Open-ended text generation quality
The balance depends on calibration data mix. By controlling the ratio of general, math, and coding data in calibration, REAM navigates this Pareto frontier effectively.
Results
- Often outperforms pruning baselines (REAP and others)
- In many cases comparable to original uncompressed models
- Significant memory reduction achieved
- Open source code available
Why This Matters
MoE models are becoming the dominant architecture for frontier LLMs. Better compression techniques like REAM could make these models practical for deployment on consumer hardware and edge devices.