CVPR 2026: Mutual Pair Merging Cuts Vision Transformer Latency 60% on Raspberry Pi 5

2026-04-08T09:59:48.737Z·1 min read

Researchers have developed Mutual Pair Merging (MPM), a training-free token aggregation method for Vision Transformers that reduces inference latency by up to 60% on edge devices while keeping accu...

MPM: Training-Free Token Merging Slashes Vision Transformer Inference Time by 60%

Researchers have developed Mutual Pair Merging (MPM), a training-free token aggregation method for Vision Transformers that reduces inference latency by up to 60% on edge devices while keeping accuracy loss below 3% mIoU.

The Problem

Vision Transformers process images as sequences of tokens — this is expensive. Existing token reduction methods face two issues:

They target classification, not dense prediction (segmentation)
On modern accelerators, computing merge maps can erase expected speed gains

How MPM Works

Form mutual nearest-neighbor pairs in cosine space
Average each pair to reduce token count
Record merge maps for reconstruction before the decoder
Insert at discrete layers — no continuous compression knob needed

Key Results

Device	Metric	Improvement
Raspberry Pi 5	Per-image latency (ViT-Tiny, ADE20K)	-60%
NVIDIA H100	Throughput (w/ FlashAttention-2)	+20%
Accuracy	mIoU drop	< 3%
Parameters	New learned parameters	0

Why This Matters

Edge AI — Makes ViT-based segmentation practical on $80 single-board computers
No retraining — Works with existing models, no fine-tuning required
Overhead-aware — Explicitly accounts for merge map computation cost (a common oversight)
CVPR 2026 — Accepted to Findings, showing peer recognition

↗ Original source · 2026-04-08T00:00:00.000Z

Comments0