Data Attribution in Adaptive Learning: Why Standard Methods Fail When AI Generates Its Own Training Data
As ML models increasingly generate their own training data — through online bandits, reinforcement learning, and post-training pipelines for language models — standard data attribution methods become fundamentally unreliable. New research formalizes why and proposes a fix.
The Problem
Standard data attribution methods (like influence functions, TracIn, etc.) assume a static dataset. But in adaptive learning settings:
- A single training observation both updates the learner AND shifts the distribution of future data
- This feedback loop invalidates static attribution assumptions
- You can't tell if performance improvement came from a specific data point or from the distribution shift it caused
Formal Result
The paper proves that "replay-side information cannot recover occurrence-level attribution in general" — meaning you can't simply re-analyze logged data to figure out what mattered. This is a fundamental impossibility result, not just a practical limitation.
When It Works
The researchers identify a specific structural class of adaptive learning problems where the target IS identifiable from logged data — providing a principled condition for when standard attribution can still be applied.
Why This Matters
This is increasingly relevant as AI training moves from static datasets to dynamic, self-generated data:
- RLHF — Models trained on their own outputs
- Constitutional AI — Iterative self-improvement
- Online learning — Models that adapt continuously from user interactions
- Self-play — Game-playing AIs that generate training games
Implications
- Existing attribution tools may give misleading results in adaptive settings
- New attribution frameworks are needed for modern AI training pipelines
- The structural conditions identified provide a path forward