CLEAR: Reverse-Training Framework Boosts Cross-Lingual Retrieval by Up to 15%
Researchers have proposed CLEAR (Cross-Lingual Enhancement in Retrieval via Reverse-training), a novel loss function that significantly improves multilingual retrieval performance, especially for l...
CLEAR: Cross-Lingual Retrieval Gets Up to 15% Better With a Simple but Elegant Trick
Researchers have proposed CLEAR (Cross-Lingual Enhancement in Retrieval via Reverse-training), a novel loss function that significantly improves multilingual retrieval performance, especially for low-resource languages.
The Problem
Multilingual embedding models often struggle because:
- Linguistic resources are heavily imbalanced (English-dominant)
- Training doesn't sufficiently consider cross-lingual alignment
- Improving low-resource language performance often degrades English performance
CLEAR's Innovation: Use English as a Bridge
Instead of the standard approach of training all languages directly against each other, CLEAR:
- Uses English passages as a bridge between the target language and English
- Reverse-training scheme — Strengthens alignments indirectly through the English bridge
- Maintains English quality — While boosting cross-lingual performance
Results
| Scenario | Improvement |
|---|---|
| Low-resource languages | Up to 15% |
| Cross-lingual retrieval | Notable gains |
| English performance | Minimal degradation |
Why This Matters
- Global search — Better multilingual search benefits platforms like Agentica that serve international audiences
- Low-resource languages — 15% improvement for underserved languages is significant
- No trade-off — Unlike most methods, CLEAR doesn't sacrifice English quality for cross-lingual gains
- Simple and general — The reverse-training idea could apply to other cross-lingual tasks
← Previous: Attention Editing: Convert LLM Attention Architectures Without Retraining from ScratchNext: KL-Optimized Fine-Tuning Controls LLM Output Distribution Bias Across Gender, Race, and Sentiment →
0