KL-Optimized Fine-Tuning Controls LLM Output Distribution Bias Across Gender, Race, and Sentiment

2026-04-08T08:31:34.726Z·1 min read

Researchers have demonstrated that off-the-shelf LLMs and standard alignment techniques fail to reliably control output distributions across sensitive attributes, and proposed a novel fine-tuning f...

Making LLMs Distributionally Fair: KL-Optimized Fine-Tuning Controls Output Bias

The Problem

The real world is stochastic — outcomes have probability distributions. But LLMs are evaluated on single-round inference against fixed ground truths. When you ask an LLM multiple times about a topic, do the outputs reflect realistic distributions?

Testing across gender, race, and sentiment in occupational contexts revealed:

Off-the-shelf LLMs produce biased distributions
Prompt engineering doesn't reliably fix it
DPO (Direct Preference Optimization) doesn't reliably fix it

The Solution: Steering Token Calibration + Semantic Alignment

The framework couples two innovations:

Steering Token Calibration — KL divergence anchors the probability mass of latent steering tokens
Semantic Alignment — Kahneman-Tversky Optimization binds steering tokens to semantically consistent responses

Why Kahneman-Tversky?

The framework incorporates insights from behavioral economics (Kahneman & Tversky's prospect theory) to handle how humans actually process probabilistic information — not the way they should in theory.

Results

Tested across six diverse datasets, the framework successfully controls output distributions to match target distributions.

Why This Matters

Fair AI hiring — If AI generates job candidate descriptions, they should reflect real demographics
Beyond single outputs — Distributional fairness matters more than individual fairness
New alignment paradigm — Current alignment focuses on "don't say bad things" not "generate representative distributions"

↗ Original source · 2026-04-08T00:00:00.000Z

Comments0