ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces

arXiv cs.CL·Jinu Lee, Shivam Agarwal, Amruta Parulekar, Siddarth Madala, Dilek Hakkani-Tur, Julia Hockenmaier

2d ago

·~1 min·6/5/2026·en·0

Quick Answer

Quick Take

ReasoningFlow introduces a framework for analyzing large reasoning models (LRMs) through directed acyclic graphs (DAGs), revealing common structural patterns and diverse reasoning behaviors across 1,260 traces from five models. Key findings include that most erroneous steps do not contribute to final answers, and that mechanistic dependencies do not align with language discourse structures.

Key Points

Developed an annotation schema validated with 31 reasoning traces and high inter-annotator agreement.
Automated annotation scaled to 1,260 traces across math, science, and argumentation tasks.
Found that most erroneous steps in LRMs do not lead to correct final answers.
ReasoningFlow reveals diverse reasoning behaviors like local verification and self-reflection.
Dataset and code available at https://github.com/jinulee-v/reasoningflow.

Article Content

From source RSS / original summary

arXiv:2606. 05402v1 Announce Type: new Abstract: Large reasoning models (LRMs) produce reasoning traces with non-linear structures, such as backtracking and self-correction, that complicate the evaluation and monitoring of the reasoning process. We introduce ReasoningFlow, a framework that captures the discourse structures of LRM reasoning traces into fine-grained directed acyclic graphs (DAGs). We develop and validate our annotation schema through careful manual annotation of 31 traces (2.

1k steps), achieving high inter-annotator agreement, then scale to automatic annotation of 1,260 traces (247. 7k steps) spanning three tasks (math, science, argumentation) and five models (Qwen2. 5-32B-Inst, QwQ-32B, DeepSeek-V3, DeepSeek-R1, GPT-oss-120B). By analyzing ReasoningFlow graphs, we find: (1) LRMs exhibit structurally similar traces, despite being trained from different base models and potentially non-overlapping post-training data. (2) ReasoningFlow reveals diverse fine-grained reasoning behaviors (e.

g. , local verification, self-reflection, and assumptions) that can be used for better reasoning trace monitorability. (3) In LRMs, most of the erroneous steps are not used to derive final answers. (4) Mechanistic causal dependencies between steps do not reflect the language-level discourse structure. We release the dataset and code in: https://github. com/jinulee-v/reasoningflow.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

2w ago

FeaturedOriginal

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI Summary

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

#LLM #Agent #Inference #Policy