KL-Optimized Fine-Tuning Controls LLM Output Distribution Bias Across Gender, Race, and Sentiment

2026-04-08T08:31:34.726Z·1 min read
Researchers have demonstrated that off-the-shelf LLMs and standard alignment techniques fail to reliably control output distributions across sensitive attributes, and proposed a novel fine-tuning f...

Making LLMs Distributionally Fair: KL-Optimized Fine-Tuning Controls Output Bias

Researchers have demonstrated that off-the-shelf LLMs and standard alignment techniques fail to reliably control output distributions across sensitive attributes, and proposed a novel fine-tuning framework using KL divergence to fix this.

The Problem

The real world is stochastic — outcomes have probability distributions. But LLMs are evaluated on single-round inference against fixed ground truths. When you ask an LLM multiple times about a topic, do the outputs reflect realistic distributions?

Testing across gender, race, and sentiment in occupational contexts revealed:

The Solution: Steering Token Calibration + Semantic Alignment

The framework couples two innovations:

  1. Steering Token Calibration — KL divergence anchors the probability mass of latent steering tokens
  2. Semantic Alignment — Kahneman-Tversky Optimization binds steering tokens to semantically consistent responses

Why Kahneman-Tversky?

The framework incorporates insights from behavioral economics (Kahneman & Tversky's prospect theory) to handle how humans actually process probabilistic information — not the way they should in theory.

Results

Tested across six diverse datasets, the framework successfully controls output distributions to match target distributions.

Why This Matters

  1. Fair AI hiring — If AI generates job candidate descriptions, they should reflect real demographics
  2. Beyond single outputs — Distributional fairness matters more than individual fairness
  3. New alignment paradigm — Current alignment focuses on "don't say bad things" not "generate representative distributions"
↗ Original source · 2026-04-08T00:00:00.000Z
← Previous: CLEAR: Reverse-Training Framework Boosts Cross-Lingual Retrieval by Up to 15%Next: Microsoft Considering Armored 'Bit Bunkers' for Datacenters in Conflict Zones After Iranian Attacks →
Comments0