New Research: User Turn Generation as a Probe of Interaction Awareness in Language Models
Measuring What Benchmarks Miss: Does Your LLM Understand Conversations?
Standard LLM benchmarks only evaluate the assistant turn, leaving unmeasured whether the model encodes awareness of what follows its response. A new paper proposes user-turn generation as a probe for this gap.
The Key Finding
Across 11 open-weight LLMs (Qwen3.5, gpt-oss, GLM) and 5 datasets, researchers found that interaction awareness is decoupled from task accuracy. In the Qwen3.5 family, GSM8K accuracy scales from 41% (0.8B) to 96.8% (397B), yet genuine follow-up rates remain near zero under deterministic generation.
What This Means
- Current benchmarks measure only one dimension: can the model answer correctly?
- They miss whether the model understands the conversation as a two-way interaction
- Higher temperature sampling reveals that interaction awareness is latent, reaching 22% follow-up rates
- The gap between task accuracy and interaction awareness widens with model size
Collaboration-Oriented Post-Training
The researchers demonstrate that post-training specifically targeting collaboration increases follow-up rates, suggesting this dimension can be improved without sacrificing task performance.
Why It Matters
As LLMs are deployed as conversational agents, understanding whether they grasp the interactive nature of dialogue becomes critical. An agent that gives correct answers but cannot anticipate what a user might ask next is fundamentally limited.
arXiv: 2604.02315