PaliBench: A Multi-Reference Blueprint for Classical Language Translation Benchmarks

arXiv cs.CL·M\'at\'e Metzger, Nadnapang Phophichit

1d ago

·~2 min·5/19/2026·en·3

Quick Take

PaliBench introduces a multi-reference benchmark for translating classical texts, enhancing evaluation methods for language models.

Key Points

Develops a benchmark for Pali-to-English translation.
Evaluates ten language models with diverse metrics.
Methodology applicable to other classical texts.

📖 Reader Mode

~2 min read

[Submitted on 16 May 2026]

View PDF HTML (experimental)

Abstract:Digital humanities projects increasingly rely on machine translation and large language models to widen access to classical, religious, and otherwise under-translated textual traditions. Yet standard translation benchmarks are poorly suited to such materials: they typically compare a system output against a single reference translation, even though classical texts often support multiple faithful renderings that differ in terminology, register, and interpretation. This article introduces PaliBench, both a benchmark for Pali-to-English translation and a reusable method for constructing multi-reference translation benchmarks for classical languages. The Pali case study draws on passages from the Sutta Pitaka aligned with independent English translations by Bhikkhu Sujato, Bhikkhu Thanissaro, and Bhikkhu Bodhi. The workflow combines LLM-assisted alignment of independently segmented translations, automated verification against source files, passage-level quality filtering, deduplication of formulaic repetitions, and multi-metric evaluation against multiple human references. The resulting benchmark contains 1,700 passages spanning 8,389 segments and approximately 345,000 tokens. We use it to evaluate ten contemporary large language models with complementary metrics, finding strong cross-metric concordance in system rankings alongside substantial variation in reliability and semantic outlier rates. The broader contribution is methodological: PaliBench shows how existing scholarly translations can be transformed into evaluation infrastructure for interpretive textual traditions without treating any single translation as definitive. Although developed for Pali Buddhist texts, the approach could be portable to other classical corpora where sufficient independent reference translations exist.

Comments:	Preprint. This manuscript has not yet been peer reviewed
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2605.16881 [cs.CL]
	(or arXiv:2605.16881v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2605.16881 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Máté Metzger [view email]
[v1] Sat, 16 May 2026 08:43:01 UTC (41 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

PaliBench: A Multi-Reference Blueprint for Classical Language Translation Benchmarks

Quick Take

Key Points

📖 Reader Mode

Submission history

More from arXiv cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

MMoA: An AI-Agent framework with recurrence for Memoried Mixure-of-Agent

Related in this space

From Prompts to Protocols: An AI Agent for Laboratory Automation

Agentic Trading: When LLM Agents Meet Financial Markets