ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces
Quick Answer
ReasoningFlow introduces a framework for analyzing large reasoning models (LRMs) through directed acyclic graphs (DAGs), revealing common structural patterns and diverse reasoning behaviors across 1,260 traces from five models.
Quick Take
ReasoningFlow introduces a framework for analyzing large reasoning models (LRMs) through directed acyclic graphs (DAGs), revealing common structural patterns and diverse reasoning behaviors across 1,260 traces from five models. Key findings include that most erroneous steps do not contribute to final answers, and that mechanistic dependencies do not align with language discourse structures.
Key Points
- Developed an annotation schema validated with 31 reasoning traces and high inter-annotator agreement.
- Automated annotation scaled to 1,260 traces across math, science, and argumentation tasks.
- Found that most erroneous steps in LRMs do not lead to correct final answers.
- ReasoningFlow reveals diverse reasoning behaviors like local verification and self-reflection.
- Dataset and code available at https://github.com/jinulee-v/reasoningflow.
Article Content
From source RSS / original summaryarXiv:2606. 05402v1 Announce Type: new Abstract: Large reasoning models (LRMs) produce reasoning traces with non-linear structures, such as backtracking and self-correction, that complicate the evaluation and monitoring of the reasoning process. We introduce ReasoningFlow, a framework that captures the discourse structures of LRM reasoning traces into fine-grained directed acyclic graphs (DAGs). We develop and validate our annotation schema through careful manual annotation of 31 traces (2.
1k steps), achieving high inter-annotator agreement, then scale to automatic annotation of 1,260 traces (247. 7k steps) spanning three tasks (math, science, argumentation) and five models (Qwen2. 5-32B-Inst, QwQ-32B, DeepSeek-V3, DeepSeek-R1, GPT-oss-120B). By analyzing ReasoningFlow graphs, we find: (1) LRMs exhibit structurally similar traces, despite being trained from different base models and potentially non-overlapping post-training data. (2) ReasoningFlow reveals diverse fine-grained reasoning behaviors (e.
g. , local verification, self-reflection, and assumptions) that can be used for better reasoning trace monitorability. (3) In LRMs, most of the erroneous steps are not used to derive final answers. (4) Mechanistic causal dependencies between steps do not reflect the language-level discourse structure. We release the dataset and code in: https://github. com/jinulee-v/reasoningflow.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.