Don't Blink: Vision-Language Models Can Become More Accurate While Losing Visual Grounding — "Evidence Collapse"

Available in: 中文

2026-04-07T16:06:34.976Z·1 min read

A new paper reveals a dangerous failure mode in multimodal AI systems: as reasoning vision-language models (VLMs) think through problems, they can become more accurate in their final answer while progressively losing their visual grounding — essentially "forgetting" what they're looking at.

The Discovery

Researchers found that reasoning VLMs exhibit a "pervasive evidence-collapse phenomenon":

Attention drifts away from evidence — As reasoning unfolds, attention to annotated evidence regions drops substantially, often losing over half of evidence mass
Confidence increases while grounding decreases — The model becomes more confident even as it pays less attention to the visual information
Low-entropy danger zone — Predictions that appear confident (low entropy) may actually be ungrounded

Why This Is Dangerous

This creates "task-conditional danger zones" where:

The model produces a confident answer
The answer happens to be correct (by reasoning or luck)
But the reasoning process has disconnected from the visual evidence
Text-only monitoring cannot detect this failure mode

On sustained visual-reference tasks (like medical imaging or document analysis), this could lead to confident but ungrounded diagnoses or decisions.

The Fix

The researchers developed a targeted "vision veto" mechanism:

Detects when visual engagement drops below threshold
Reduces selective risk by up to 1.9 percentage points at 90% coverage
Avoids degradations where disengagement is harmless (symbolic tasks)

Broader Implications

This research suggests that simply making AI models "think longer" (chain-of-thought reasoning) can backfire for multimodal tasks — the model may reason its way away from the very evidence it should be using. For applications like medical imaging analysis, autonomous driving, or document verification, this is a critical safety concern.

↗ Original source · 2026-04-07T00:00:00.000Z

Comments0

Don't Blink: Vision-Language Models Can Become More Accurate While Losing Visual Grounding — "Evidence Collapse"

The Discovery

Why This Is Dangerous

The Fix

Broader Implications

Related Articles