Policy Gradient Derivation Demystified: The Missing 'Causality' Step in Reinforcement Learning Education
A new paper isolates and clarifies the often-hand-waved "causality" step in policy gradient derivations — the point where full returns are replaced by reward-to-go — providing a mathematically rigorous treatment of a concept taught imprecisely in every RL course.
The Problem
Every reinforcement learning textbook teaches policy gradients like this:
- Derive the REINFORCE estimator using full trajectory return
- State that by "causality," full return can be replaced by reward-to-go
- Move on
But how does causality justify this replacement? The derivation is typically glossed over, leaving students confused about why future rewards can be dropped.
The Fix
The paper shows that:
- Reward-to-go arises directly from decomposing the objective over prefix trajectories
- The "causality" argument is a corollary of the derivation, not an additional heuristic
- Using prefix trajectory distributions and the score-function identity, the replacement is mathematically explicit
- No separate "causality" step is needed — it's baked into the mathematics
Technical Insight
| Traditional Presentation | This Paper's Approach |
|---|---|
| Derive with full return | Derive over prefix distributions |
| Hand-wave "causality" | Causality emerges naturally |
| Reward-to-go as post-hoc fix | Reward-to-go is intrinsic |
| Two-step (derive + fix) | One-step (correct from start) |
Why It Matters
- Education — Every RL student encounters this confusion
- Understanding — Proper derivation leads to deeper intuition
- Implementation — Clearer math prevents bugs in reward shaping
- Reproducibility — Standardized derivations improve code reliability
Subject: cs.AI
The paper is classified under Artificial Intelligence on arXiv, indicating its primary audience is the machine learning community.
Broader Significance
This is a "small but important" contribution — not a breakthrough algorithm, but a correction to how we teach and understand one of the most fundamental concepts in reinforcement learning. Good pedagogy matters: it shapes how the next generation of researchers thinks about problems.