Reasoning Models Don't Just Think Longer, They Move Differently
Quick Take
Reasoning-trained models exhibit distinct trajectory patterns during problem-solving, influenced by generation length.
Key Points
- Longer reasoning chains alter trajectory statistics.
- Corrected geometry reveals domain-specific reasoning patterns.
- Length adjustment is crucial for trajectory analysis.
📖 Reader Mode
~2 min readAbstract:Reasoning-trained language models often spend more tokens on harder problems, but longer chains of thought do not show whether a model is merely computing for more steps or following a different internal trajectory. We study this distinction through hidden-state trajectories during chain-of-thought generation across competitive programming, mathematics, and Boolean satisfiability. Raw trajectory geometry is strongly shaped by generation length: longer generations mechanically alter path statistics, so difficulty-dependent comparisons are misleading without adjustment. After residualizing trajectory statistics on length, difficulty remains systematically coupled to corrected trajectory geometry across all domains studied. The clearest reasoning-specific separation appears in the code domain, where harder problems show more direct corrected trajectories and less heterogeneous local curvature in reasoning-trained models than in matched instruction-tuned baselines. Corrected difficulty-geometry coupling is weaker, but still present, in mathematics and Boolean satisfiability. Prompt-stage linear probes do not mirror the code-domain separation, and behavioral annotations show that stronger corrected coupling co-occurs with strategy shifts and uncertainty monitoring. Together, these findings establish length correction as a prerequisite for generation-time trajectory analysis and show that reasoning training can be associated with distinct corrected trajectory geometry, with the strength of the effect depending on the domain.
| Comments: | Preprint |
| Subjects: | Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML) |
| Cite as: | arXiv:2605.15454 [cs.CL] |
| (or arXiv:2605.15454v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.15454 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Anders Gjølbye [view email]
[v1]
Thu, 14 May 2026 22:37:33 UTC (653 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.