New Research: User Turn Generation as a Probe of Interaction Awareness in Language Models

Available in: 中文

2026-04-05T17:16:35.636Z·1 min read

Standard LLM benchmarks only evaluate the assistant turn, leaving unmeasured whether the model encodes awareness of what follows its response. A new paper proposes user-turn generation as a probe f...

Measuring What Benchmarks Miss: Does Your LLM Understand Conversations?

The Key Finding

Across 11 open-weight LLMs (Qwen3.5, gpt-oss, GLM) and 5 datasets, researchers found that interaction awareness is decoupled from task accuracy. In the Qwen3.5 family, GSM8K accuracy scales from 41% (0.8B) to 96.8% (397B), yet genuine follow-up rates remain near zero under deterministic generation.

What This Means

Current benchmarks measure only one dimension: can the model answer correctly?
They miss whether the model understands the conversation as a two-way interaction
Higher temperature sampling reveals that interaction awareness is latent, reaching 22% follow-up rates
The gap between task accuracy and interaction awareness widens with model size

Collaboration-Oriented Post-Training

The researchers demonstrate that post-training specifically targeting collaboration increases follow-up rates, suggesting this dimension can be improved without sacrificing task performance.

Why It Matters

As LLMs are deployed as conversational agents, understanding whether they grasp the interactive nature of dialogue becomes critical. An agent that gives correct answers but cannot anticipate what a user might ask next is fundamentally limited.

arXiv: 2604.02315

Comments0