Model-Based Quality Assessment for Massively Multilingual Parallel Data

arXiv cs.CL·Abdelaziz M. A. Ibrahim, Zihao Li, J\"org Tiedemann, Shaoxiong Ji

2h ago

·~1 min·6/2/2026·en·0

Quick Take

This study presents a model-based approach for assessing multilingual parallel data quality, focusing on parallelism and reference-free quality estimation. It benchmarks four embedding models on FLORES-200 and evaluates nine quality estimators on 41,412 translation directions, revealing no single model is universally reliable, suggesting a direction-aware calibration approach is necessary.

Key Points

Parallelism assessed using four embedding models on FLORES-200 and BOUQuET tasks.
Nine reference-free evaluators tested on 41,412 translation directions.
No universal model reliability across translation directions was found.
Naive quality estimation ensembles dilute strong model signals.
Higher QE scores correlate with documented target-language coverage.

Article Excerpt

From source RSS / original summary

arXiv:2606. 00285v1 Announce Type: new Abstract: Large-scale multilingual bitext often contains two distinct problems: non-parallel sentence pairs and low-quality translations. We decompose model-based assessment for such data into two independent components: parallelism assessment with multilingual embeddings and reference-free quality estimation (QE).

For parallelism, we benchmark four embedding models on FLORES-200 and BOUQuET retrieval tasks, covering 6,654 source--target directions in our target language-pair inventory. For QE, we evaluate nine reference-free evaluators on professional FLORES-200 translations across 41,412 ordered source--target directions. Results show that no model is universally reliable across translation directions.

Naive QE ensembles dilute strong model signals, while documented target-language coverage is strongly associated with higher QE scores. Overall, these findings suggest that multilingual parallel-data assessment is best approached as a direction-aware routing and calibration problem, where no single universal metric is expected to suffice across all languages.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

1w ago

FeaturedOriginal

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI Summary

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

#LLM #Agent #Inference #Policy