CVPR 2026: Mutual Pair Merging Cuts Vision Transformer Latency 60% on Raspberry Pi 5

2026-04-08T09:59:48.737Z·1 min read
Researchers have developed Mutual Pair Merging (MPM), a training-free token aggregation method for Vision Transformers that reduces inference latency by up to 60% on edge devices while keeping accu...

MPM: Training-Free Token Merging Slashes Vision Transformer Inference Time by 60%

Researchers have developed Mutual Pair Merging (MPM), a training-free token aggregation method for Vision Transformers that reduces inference latency by up to 60% on edge devices while keeping accuracy loss below 3% mIoU.

The Problem

Vision Transformers process images as sequences of tokens — this is expensive. Existing token reduction methods face two issues:

How MPM Works

  1. Form mutual nearest-neighbor pairs in cosine space
  2. Average each pair to reduce token count
  3. Record merge maps for reconstruction before the decoder
  4. Insert at discrete layers — no continuous compression knob needed

Key Results

DeviceMetricImprovement
Raspberry Pi 5Per-image latency (ViT-Tiny, ADE20K)-60%
NVIDIA H100Throughput (w/ FlashAttention-2)+20%
AccuracymIoU drop< 3%
ParametersNew learned parameters0

Why This Matters

  1. Edge AI — Makes ViT-based segmentation practical on $80 single-board computers
  2. No retraining — Works with existing models, no fine-tuning required
  3. Overhead-aware — Explicitly accounts for merge map computation cost (a common oversight)
  4. CVPR 2026 — Accepted to Findings, showing peer recognition
↗ Original source · 2026-04-08T00:00:00.000Z
← Previous: US Deploying Large Fleet of F-35 Fighters to Japan's Misawa Air Base Amid Pacific TensionsNext: Dexterous Robot Hand Uses Force Maps and Language Guidance to Handle Fragile Objects Safely →
Comments0