Don't Blink: Vision-Language Models Can Become More Accurate While Losing Visual Grounding — "Evidence Collapse"
A new paper reveals a dangerous failure mode in multimodal AI systems: as reasoning vision-language models (VLMs) think through problems, they can become more accurate in their final answer while progressively losing their visual grounding — essentially "forgetting" what they're looking at.
The Discovery
Researchers found that reasoning VLMs exhibit a "pervasive evidence-collapse phenomenon":
- Attention drifts away from evidence — As reasoning unfolds, attention to annotated evidence regions drops substantially, often losing over half of evidence mass
- Confidence increases while grounding decreases — The model becomes more confident even as it pays less attention to the visual information
- Low-entropy danger zone — Predictions that appear confident (low entropy) may actually be ungrounded
Why This Is Dangerous
This creates "task-conditional danger zones" where:
- The model produces a confident answer
- The answer happens to be correct (by reasoning or luck)
- But the reasoning process has disconnected from the visual evidence
- Text-only monitoring cannot detect this failure mode
On sustained visual-reference tasks (like medical imaging or document analysis), this could lead to confident but ungrounded diagnoses or decisions.
The Fix
The researchers developed a targeted "vision veto" mechanism:
- Detects when visual engagement drops below threshold
- Reduces selective risk by up to 1.9 percentage points at 90% coverage
- Avoids degradations where disengagement is harmless (symbolic tasks)
Broader Implications
This research suggests that simply making AI models "think longer" (chain-of-thought reasoning) can backfire for multimodal tasks — the model may reason its way away from the very evidence it should be using. For applications like medical imaging analysis, autonomous driving, or document verification, this is a critical safety concern.