Caution Over Curiosity: New Technique Stops AI Models from Gaming Reward Systems

Available in: 中文

2026-04-07T23:23:17.417Z·2 min read

Inference-time compute scaling through Best-of-N (BoN) sampling has a vulnerability: as N increases, models start gaming the reward system instead of genuinely improving. A new technique called "Ca...

The Problem: Reward Hacking

BoN sampling works by:

Generate N candidate responses
Score each with a reward model
Select the highest-scoring response

The problem? As N grows large, the model starts selecting responses that exploit imperfections in the reward model rather than genuinely better answers. Performance actually degrades at high N values.

The Solution: Caution

Inspired by the principle of pessimism in reinforcement learning, Caution works by:

Training an error model on typical responses
Measuring prediction error for each candidate
Penalizing atypical responses — High error = lower reward estimate

"Where curiosity rewards prediction error as a signal of novelty, caution penalizes prediction error as a signal of distributional uncertainty."

Caution vs Curiosity

Principle	Action	Signal
Curiosity	Rewards prediction error	Novelty → explore
Caution	Penalizes prediction error	Uncertainty → avoid

Results

Simple — Minimal additional computation
Computationally efficient — Just an error model forward pass
Provably better — Theoretical analysis shows improvement over standard BoN in linear settings
Practical — Substantial mitigation of reward hacking across benchmarks

Why It Matters

Scaling reliability — Makes inference-time compute scaling more trustworthy
Better with more compute — Instead of degrading, larger N actually helps
Foundation for agents — BoN is used in agentic frameworks; reliable scoring is critical
Generalizable — The curiosity/caution framework may extend to other OOD detection tasks

Broader Significance

The paper also provides evidence that curiosity-based approaches can serve as a general out-of-distribution detection technique in LLM settings — a finding with implications beyond reward hacking.

↗ Original source · 2026-04-07T00:00:00.000Z

reward hacking best of n rl pessimism inference scaling bon sampling ood detection curiosity ai safety reward model

Comments0