Disentangling Language Roles in Multilingual LLM Task Execution

arXiv cs.CL·Qishi Zhan, Minxuan Hu, Seoyeon Jang, Lei Zhao, Ziheng Chen, Man Liang, Xinyue Xiang, Jiaxin Liu, Guansu Wang, Liang He

2d ago

·~2 min·5/28/2026·en·2

Quick Take

The MTM-Bench benchmark evaluates multilingual LLMs by isolating instruction, content, and response languages, revealing that response-language role significantly impacts task execution. Testing 20 LLMs, including frontier models, indicates that mismatch count is not a reliable predictor of performance degradation, with semantic correctness alone insufficient for multilingual task success.

Key Points

MTM-Bench includes 27 language triplets across English, Spanish, and Chinese.
Evaluated 20 LLMs using metrics for semantic correctness and language adherence.
Response-language role is the primary factor affecting task performance.
Mismatch count does not monotonically predict difficulty across models.
Distinct failure modes exist for different task families in multilingual execution.

Article Content

From source RSS / original summary

arXiv:2605. 27649v1 Announce Type: new Abstract: Multilingual LLMs are increasingly used when instruction, source content, and required response languages do not coincide. Existing benchmarks have expanded multilingual instruction-following evaluation, but they rarely isolate these three roles within a fully crossed design. We introduce MTM-Bench, a controlled benchmark for language-conditioned task execution in which each instance is defined by a triplet \((L_{\text{instr}}, L_{\text{content}}, L_{\text{resp}})\).

Across English, Spanish, and Chinese, MTM-Bench enumerates all 27 triplets and contains 2{,}430 instances per model across semantic reversal, final-state extraction, and language purity with update realization. We evaluate 20 frontier and open-weight LLMs using decomposed metrics for semantic correctness, target-language adherence, constraint satisfaction, contamination ratio, and joint success, with scoring validated by a targeted human audit.

The fully crossed design reveals that degradation is organized by the role a language occupies in the task structure, not merely by mismatch count. The response-language role is the dominant axis of variation, and a single response-slot mismatch accounts for most degradation. The response-only and full-mismatch comparison suggests that mismatch count is not a monotonic predictor of difficulty, with model-level ordering varying across systems.

Task families fail through distinct channels, showing that semantic correctness alone does not capture reliable multilingual task execution.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

Disentangling Language Roles in Multilingual LLM Task Execution

Quick Take

Key Points

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

What are They Thinking? Delineation, Probing and Tracking of Concepts in LLMs

In-Context Optimization for Retrieval-Augmented Generation: A Gradient-Descent Perspective