REAM: Merging Instead of Pruning Mixture-of-Experts Preserves Performance While Cutting Memory

Available in: 中文
2026-04-07T16:07:53.324Z·1 min read
Mixture-of-Experts models like Mixtral and DeepSeek are among the top-performing LLM architectures, but with hundreds of billions of parameters, they pose massive memory challenges:

A new technique called REAM (Router-weighted Expert Activation Merging) challenges the conventional approach of pruning experts in Mixture-of-Experts (MoE) large language models. Instead of removing experts entirely, REAM groups and merges their weights, better preserving original model performance.

The MoE Problem

Mixture-of-Experts models like Mixtral and DeepSeek are among the top-performing LLM architectures, but with hundreds of billions of parameters, they pose massive memory challenges:

REAM's Innovation

Previous work (REAP) pruned experts — removing them entirely. REAM takes a different approach:

  1. Group similar experts — Find experts that activate on similar inputs
  2. Merge their weights — Combine expert parameters weighted by router scores
  3. Preserve knowledge — No information is thrown away, just compressed

The Key Finding: MC vs GEN Tradeoff

The research reveals an important trade-off between:

The balance depends on calibration data mix. By controlling the ratio of general, math, and coding data in calibration, REAM navigates this Pareto frontier effectively.

Results

Why This Matters

MoE models are becoming the dominant architecture for frontier LLMs. Better compression techniques like REAM could make these models practical for deployment on consumer hardware and edge devices.

↗ Original source · 2026-04-07T00:00:00.000Z
← Previous: Schema-Aware Planning and Hybrid Knowledge Verification for Trustworthy Knowledge GraphsNext: RESCORE: LLM Agents Automatically Recover Simulations from Research Papers at 10x Human Speed →
Comments0