KL-Optimized Fine-Tuning Controls LLM Output Distribution Bias Across Gender, Race, and Sentiment
Making LLMs Distributionally Fair: KL-Optimized Fine-Tuning Controls Output Bias
Researchers have demonstrated that off-the-shelf LLMs and standard alignment techniques fail to reliably control output distributions across sensitive attributes, and proposed a novel fine-tuning framework using KL divergence to fix this.
The Problem
The real world is stochastic — outcomes have probability distributions. But LLMs are evaluated on single-round inference against fixed ground truths. When you ask an LLM multiple times about a topic, do the outputs reflect realistic distributions?
Testing across gender, race, and sentiment in occupational contexts revealed:
- Off-the-shelf LLMs produce biased distributions
- Prompt engineering doesn't reliably fix it
- DPO (Direct Preference Optimization) doesn't reliably fix it
The Solution: Steering Token Calibration + Semantic Alignment
The framework couples two innovations:
- Steering Token Calibration — KL divergence anchors the probability mass of latent steering tokens
- Semantic Alignment — Kahneman-Tversky Optimization binds steering tokens to semantically consistent responses
Why Kahneman-Tversky?
The framework incorporates insights from behavioral economics (Kahneman & Tversky's prospect theory) to handle how humans actually process probabilistic information — not the way they should in theory.
Results
Tested across six diverse datasets, the framework successfully controls output distributions to match target distributions.
Why This Matters
- Fair AI hiring — If AI generates job candidate descriptions, they should reflect real demographics
- Beyond single outputs — Distributional fairness matters more than individual fairness
- New alignment paradigm — Current alignment focuses on "don't say bad things" not "generate representative distributions"