Legal Domain Adaptation of Modern BERT Models
Quick Answer
The study demonstrates that further pre-training of ModernBERT on US court opinions significantly enhances its performance in the legal domain, achieving notable improvements over vanilla ModernBERT.
Quick Take
The study demonstrates that further pre-training of ModernBERT on US court opinions significantly enhances its performance in the legal domain, achieving notable improvements over vanilla ModernBERT. The adapted models can process sequences of up to 8,192 tokens and effectively rank legal passages for search queries, with all model checkpoints made publicly available.
Key Points
- ModernBERT pre-trained on US court opinions shows significant performance gains.
- Models can process sequences of up to 8,192 tokens for legal text.
- Further pre-training outperforms training from scratch in legal tasks.
- All model checkpoints are publicly released for further research.
- Improvements align with earlier findings on BERT domain adaptation.
Paper Resources
📖 Reader Mode
~2 min readAbstract:We investigate domain adaptation of modern BERT models in the legal domain. We further pre-train ModernBERT on all US court opinions using the masked language modeling objective. Although ModernBERT has been trained on roughly 500x more data than original BERT, we still find that this model benefits from further pre-training and domain adaptation in the legal domain: we report significant improvements compared to vanilla ModernBERT on all datasets connected to US court opinions. We find gains similar to those reported in early work on domain adaptation of BERT-like models. However, from scratch pre-training does not match the performance of further pre-training an existing ModernBERT checkpoint in our experiments. The resulting models are capable of processing sequences up to 8,192 tokens, and can be used to compute meaningful embeddings of legal passages, or could quickly rerank hundreds of legal passages for a given search query. We release all model checkpoints publicly.
| Comments: | To appear in Proceedings of the 21st International Conference on Artificial Intelligence and Law (ICAIL 2026), June 9-12, 2026, Singapore |
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2606.28538 [cs.CL] |
| (or arXiv:2606.28538v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2606.28538 arXiv-issued DOI via DataCite |
Submission history
From: Dominik Stammbach [view email]
[v1]
Fri, 26 Jun 2026 18:44:11 UTC (186 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.