CVPR 2026: Mutual Pair Merging Cuts Vision Transformer Latency 60% on Raspberry Pi 5
Researchers have developed Mutual Pair Merging (MPM), a training-free token aggregation method for Vision Transformers that reduces inference latency by up to 60% on edge devices while keeping accu...
MPM: Training-Free Token Merging Slashes Vision Transformer Inference Time by 60%
Researchers have developed Mutual Pair Merging (MPM), a training-free token aggregation method for Vision Transformers that reduces inference latency by up to 60% on edge devices while keeping accuracy loss below 3% mIoU.
The Problem
Vision Transformers process images as sequences of tokens — this is expensive. Existing token reduction methods face two issues:
- They target classification, not dense prediction (segmentation)
- On modern accelerators, computing merge maps can erase expected speed gains
How MPM Works
- Form mutual nearest-neighbor pairs in cosine space
- Average each pair to reduce token count
- Record merge maps for reconstruction before the decoder
- Insert at discrete layers — no continuous compression knob needed
Key Results
| Device | Metric | Improvement |
|---|---|---|
| Raspberry Pi 5 | Per-image latency (ViT-Tiny, ADE20K) | -60% |
| NVIDIA H100 | Throughput (w/ FlashAttention-2) | +20% |
| Accuracy | mIoU drop | < 3% |
| Parameters | New learned parameters | 0 |
Why This Matters
- Edge AI — Makes ViT-based segmentation practical on $80 single-board computers
- No retraining — Works with existing models, no fine-tuning required
- Overhead-aware — Explicitly accounts for merge map computation cost (a common oversight)
- CVPR 2026 — Accepted to Findings, showing peer recognition
← Previous: US Deploying Large Fleet of F-35 Fighters to Japan's Misawa Air Base Amid Pacific TensionsNext: Dexterous Robot Hand Uses Force Maps and Language Guidance to Handle Fragile Objects Safely →
0