Bidirectional Entropy Modulation: Rethinking Exploration in Reinforcement Learning for LLM Reasoning
Research on reinforcement learning with verifiable rewards (RLVR) has revealed a fundamental limitation: policies rapidly converge to narrow solution sets. A new paper proposes decomposing policy entropy into "informative" and "spurious" components, enabling more effective exploration without blind entropy maximization.
The Problem
RLVR has significantly advanced LLM reasoning (think o1-style reasoning). However:
- Restricted exploration — The policy quickly converges to a narrow set of solutions
- Entropy regularization is unreliable — Standard entropy bonuses have high hyperparameter sensitivity and yield marginal gains
- Blind maximization hurts — Simply increasing entropy degrades reasoning quality
The Key Insight
The paper decomposes policy entropy into two types:
- Informative entropy — Preserves diverse solution paths (good)
- Spurious entropy — Erodes reasoning patterns (bad)
Effective exploration requires maximizing informative entropy while minimizing spurious entropy — not blindly maximizing total entropy.
How It Works
- Group-relative advantage estimation — Parametric formulation to distinguish useful vs harmful exploration
- Entropy dynamics analysis — Track how entropy changes during training
- Bidirectional modulation — Simultaneously encourage diverse solutions while preserving reasoning structure
Results
The approach shows improved performance on reasoning benchmarks compared to standard entropy regularization, particularly on problems requiring diverse solution strategies.