Beyond Input Understanding: Diagnosing Multilingual Mathematical Reasoning with Directed Acyclic Trace Graphs

arXiv cs.CL·Jiaqiao Zhang, Zhoujun Li, Raoyuan Zhao, Jian Lan, Thomas Seidl, Michael A. Hedderich, Hinrich Sch\"utze, Yihong Liu

2d ago

·~1 min·5/28/2026·en·1

Quick Take

The study introduces Directed Acyclic Trace Graphs (DATG) to diagnose multilingual mathematical reasoning issues in models like Qwen3, revealing that language affects reasoning execution. Experiments show non-English reasoning suffers from reduced anchor coverage and dependency fidelity, particularly in low-resource languages, with proposed solutions improving performance.

Key Points

DATG maps reasoning traces to language-independent mathematical anchors and dependencies.
Experiments on Qwen3 across 12 languages show reduced accuracy in low-resource languages.
Non-English reasoning often suffers from reduced anchor coverage and weaker dependency fidelity.
Proposed Loop-Retry and Formula-Retry improve reasoning performance in target languages.
The findings challenge the notion that language barriers solely stem from problem statement comprehension.

Article Content

From source RSS / original summary

arXiv:2605. 27715v1 Announce Type: new Abstract: Large reasoning models (LRMs) achieve strong mathematical reasoning performance in English, but remain much less reliable in many low- and medium-resource languages. This gap is often explained as a failure to understand non-English problem statements.

We show that this view is incomplete: even when the problem is given in English, controlling the model's reasoning language can substantially reduce accuracy, suggesting that language also affects reasoning execution itself. To study this effect, we introduce DATG, a Directed Acyclic Trace Graph framework that maps reasoning traces to language-independent mathematical anchors and dependencies.

This allows us to align target-language traces with reference DAGs and measure whether they cover required mathematical nodes, respect dependency edges, and avoid harmful mathematical actions. Experiments on the Qwen3 series across 12 languages show that non-English reasoning often suffers from reduced anchor coverage and weaker dependency fidelity, especially in low-resource languages.

Motivated by this diagnosis, we propose Loop-Retry and Formula-Retry, two simple test-time controls targeting DATG-exposed failure modes, and show that they consistently improve target-language reasoning performance in low-resource languages.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

Beyond Input Understanding: Diagnosing Multilingual Mathematical Reasoning with Directed Acyclic Trace Graphs

Quick Take

Key Points

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

What are They Thinking? Delineation, Probing and Tracking of Concepts in LLMs

In-Context Optimization for Retrieval-Augmented Generation: A Gradient-Descent Perspective