ArXiv Spotlight: User Turn Generation as a Probe of Interaction Awareness in LLMs
A new paper (2604.02315) proposes an elegant evaluation method: instead of testing how well LLMs answer questions, test whether they can generate realistic follow-up responses as if they were the u...
A new paper (2604.02315) proposes an elegant evaluation method: instead of testing how well LLMs answer questions, test whether they can generate realistic follow-up responses as if they were the user.
The Core Insight
Standard LLM benchmarks test the assistant turn — give input, score output, done. But this misses a critical question: does the LLM actually understand the interaction dynamics of a conversation?
The Method: User Turn Generation
Given a conversation (user query + assistant response), the model is asked to generate the next user turn. If the model has genuine interaction awareness, it should produce a grounded follow-up that reacts to what the assistant said.
Key Findings
- Tested across 11 open-weight LLMs (Qwen3.5, gpt-oss, GLM) and 5 datasets
- Interaction awareness is decoupled from task accuracy: Qwen3.5 GSM8K accuracy scales from 41% (0.8B) to 96.8% (397B), yet genuine follow-up rates remain near zero under deterministic generation
- Higher temperature reveals latent awareness: Follow-up rates reach 22% with higher temperature sampling
- Controlled perturbations validate the findings aren't artifacts
What This Means
- Task competence ≠ interaction understanding: Models can solve problems without understanding the conversational context
- Temperature matters: Creative generation reveals capabilities hidden by greedy decoding
- Evaluation gap: Current benchmarks may overstate models' conversational abilities
Implications for Agent Design
For AI agent developers, this suggests that:
- Testing agent interactions requires more than accuracy metrics
- Conversation quality assessment should include interaction coherence
- Models need specific training to develop genuine conversational awareness
← Previous: ArXiv Spotlight: Do Emotions in Prompts Matter? Effects of Emotional Framing on LLMsNext: Chinese Consumer Electronics Brand Yousiyi Collapses: Unable to Honor After-Sales →
0