Fluency and Faithfulness in Human and Machine Literary Translation
Quick Take
Study reveals a tradeoff between fluency and faithfulness in literary translation by LLMs.
Key Points
- Analyzed 130,486 translated paragraphs from 106 novels.
- Fluency measured with translationese classifier; faithfulness with COMET-KIWI.
- Negative correlation found between fluency and faithfulness.
📖 Reader Mode
~2 min readAbstract:Literary translation requires balancing target-language fluency with faithfulness to the source. Recent large language models (LLMs) often produce fluent translations, but it remains unclear whether fluency corresponds to semantic preservation in literary text. We examine this relationship using 130,486 translated paragraphs from 106 novels in 16 source languages, including human, Google Translate, and TranslateGemma translations. Fluency is measured as original-likeness with a translationese classifier trained on paragraph part-of-speech n-grams, and faithfulness with the automatic translation evaluation metric COMET-KIWI. We control for paragraph length and find a consistent negative correlation between fluency and faithfulness. The pattern appears for both human and Google Translate, but is weaker and often non-significant for TranslateGemma. These results show that segment length matters for automatic evaluation and suggest a tradeoff between fluency and faithfulness in literary translation.
| Comments: | Accepted NLP4DH 2026 |
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2605.15282 [cs.CL] |
| (or arXiv:2605.15282v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.15282 arXiv-issued DOI via DataCite |
Submission history
From: Sarah Griebel [view email]
[v1]
Thu, 14 May 2026 18:00:34 UTC (1,971 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.