HukukBERT: First Comprehensive Turkish Legal Language Model Achieves 84.4% on Legal Cloze Test

Available in: 中文

2026-04-07T23:23:20.123Z·1 min read

Researchers have introduced HukukBERT, the most comprehensive legal language model for Turkish law, trained on 18GB of cleaned legal text using advanced domain-adaptive pre-training techniques.

The Gap in Legal AI

While English legal AI has flourished with models like Legal-BERT, Turkish law has lagged due to:

Scarcity of domain-specific data
Limited Turkish NLP resources
Lack of high-volume legal corpora

HukukBERT's Approach

The model uses a hybrid Domain-Adaptive Pre-Training (DAPT) methodology:

Whole-Word Masking — Masks entire words during training
Token Span Masking — Masks token sequences
Word Span Masking — Masks word sequences
Keyword Masking — Targets specific legal terminology

Training data: 18GB cleaned Turkish legal corpus

Tokenizer: 48K WordPiece vocabulary

Results

Benchmark	Performance
Legal Cloze Test (Top-1 accuracy)	84.40% (state-of-the-art)
Court Decision Segmentation (document pass rate)	92.8% (new SOTA)

What Is a Legal Cloze Test?

A masked legal term prediction task specifically designed for Turkish court decisions — essentially asking the model to predict which legal term should fill in a blank in a court document. Think of it as a bar exam for language models.

Why This Matters

Legal accessibility — Makes Turkish law more accessible through AI
Judicial efficiency — Automates court document analysis
Non-English NLP — Demonstrates that legal AI can work in languages beyond English
Open source — Model released to support future Turkish legal NLP research

Future Applications

The researchers envision HukukBERT enabling:

Named entity recognition in legal documents
Judgment prediction
Legal document classification
Contract analysis

↗ Original source · 2026-04-07T00:00:00.000Z

Comments0