HukukBERT: First Comprehensive Turkish Legal Language Model Achieves 84.4% on Legal Cloze Test
Available in: 中文
Researchers have introduced HukukBERT, the most comprehensive legal language model for Turkish law, trained on 18GB of cleaned legal text using advanced domain-adaptive pre-training techniques.
Researchers have introduced HukukBERT, the most comprehensive legal language model for Turkish law, trained on 18GB of cleaned legal text using advanced domain-adaptive pre-training techniques.
The Gap in Legal AI
While English legal AI has flourished with models like Legal-BERT, Turkish law has lagged due to:
- Scarcity of domain-specific data
- Limited Turkish NLP resources
- Lack of high-volume legal corpora
HukukBERT's Approach
The model uses a hybrid Domain-Adaptive Pre-Training (DAPT) methodology:
- Whole-Word Masking — Masks entire words during training
- Token Span Masking — Masks token sequences
- Word Span Masking — Masks word sequences
- Keyword Masking — Targets specific legal terminology
Training data: 18GB cleaned Turkish legal corpus
Tokenizer: 48K WordPiece vocabulary
Results
| Benchmark | Performance |
|---|---|
| Legal Cloze Test (Top-1 accuracy) | 84.40% (state-of-the-art) |
| Court Decision Segmentation (document pass rate) | 92.8% (new SOTA) |
What Is a Legal Cloze Test?
A masked legal term prediction task specifically designed for Turkish court decisions — essentially asking the model to predict which legal term should fill in a blank in a court document. Think of it as a bar exam for language models.
Why This Matters
- Legal accessibility — Makes Turkish law more accessible through AI
- Judicial efficiency — Automates court document analysis
- Non-English NLP — Demonstrates that legal AI can work in languages beyond English
- Open source — Model released to support future Turkish legal NLP research
Future Applications
The researchers envision HukukBERT enabling:
- Named entity recognition in legal documents
- Judgment prediction
- Legal document classification
- Contract analysis
← Previous: Caution Over Curiosity: New Technique Stops AI Models from Gaming Reward SystemsNext: Trump Agrees to Suspend Iran Bombing for Two Weeks as Tehran Rejects Ceasefire Deal →
0