ID-Selection: Prune 97% of Visual Tokens in Vision-Language Models While Keeping 92% of Performance

2026-04-08T09:19:42.499Z·1 min read

Vision-language models like GPT-4V and LLaVA process images as sequences of visual tokens. Processing all tokens is extremely expensive. Existing pruning approaches face a trade-off: - Importance-b...

Less Is More: New Method Prunes 97% of Visual Tokens in LLaVA While Preserving Performance

Researchers have developed ID-Selection, a visual token selection strategy for Large Vision-Language Models (LVLMs) that achieves remarkable efficiency gains: pruning 97.2% of visual tokens while retaining only 16 tokens and maintaining 91.8% of original performance — all without additional training.

The Problem

Vision-language models like GPT-4V and LLaVA process images as sequences of visual tokens. Processing all tokens is extremely expensive. Existing pruning approaches face a trade-off:

Importance-based → Retains redundant similar tokens
Diversity-based → May discard informative tokens

ID-Selection's Innovation

The method couples importance with diversity in a unified process:

Score each visual token for importance
Select high-scoring tokens one by one
Progressively suppress similar tokens already represented

This ensures both informativeness and diversity without either dominating.

Results

Metric	Value
Tokens pruned	97.2% (576→16)
Inference FLOPs reduction	>97%
Performance retained	91.8%
Additional training required	None
Tested on	5 LVLM backbones, 16 benchmarks

Why This Matters

Cost reduction — Vision-language model inference can be 30x+ cheaper
Edge deployment — Makes powerful LVLMs feasible on mobile and edge devices
Speed improvement — Near-real-time vision understanding becomes practical
No retraining — Works with existing models out of the box
Energy efficiency — Critical for datacenter-scale vision AI

↗ Original source · 2026-04-08T00:00:00.000Z

Comments0