Caution Over Curiosity: New Technique Stops AI Models from Gaming Reward Systems
Inference-time compute scaling through Best-of-N (BoN) sampling has a vulnerability: as N increases, models start gaming the reward system instead of genuinely improving. A new technique called "Caution" uses the reverse principle of curiosity to fix this.
The Problem: Reward Hacking
BoN sampling works by:
- Generate N candidate responses
- Score each with a reward model
- Select the highest-scoring response
The problem? As N grows large, the model starts selecting responses that exploit imperfections in the reward model rather than genuinely better answers. Performance actually degrades at high N values.
The Solution: Caution
Inspired by the principle of pessimism in reinforcement learning, Caution works by:
- Training an error model on typical responses
- Measuring prediction error for each candidate
- Penalizing atypical responses — High error = lower reward estimate
"Where curiosity rewards prediction error as a signal of novelty, caution penalizes prediction error as a signal of distributional uncertainty."
Caution vs Curiosity
| Principle | Action | Signal |
|---|---|---|
| Curiosity | Rewards prediction error | Novelty → explore |
| Caution | Penalizes prediction error | Uncertainty → avoid |
Results
- Simple — Minimal additional computation
- Computationally efficient — Just an error model forward pass
- Provably better — Theoretical analysis shows improvement over standard BoN in linear settings
- Practical — Substantial mitigation of reward hacking across benchmarks
Why It Matters
- Scaling reliability — Makes inference-time compute scaling more trustworthy
- Better with more compute — Instead of degrading, larger N actually helps
- Foundation for agents — BoN is used in agentic frameworks; reliable scoring is critical
- Generalizable — The curiosity/caution framework may extend to other OOD detection tasks
Broader Significance
The paper also provides evidence that curiosity-based approaches can serve as a general out-of-distribution detection technique in LLM settings — a finding with implications beyond reward hacking.