Reasoning Models Don't Just Think Longer, They Move Differently

arXiv cs.CL·Anders Gj{\o}lbye, Lars Kai Hansen, Sanmi Koyejo

5/18/2026

·~2 min·5/18/2026·en·4

Quick Answer

Quick Take

This study reveals that reasoning-trained language models exhibit distinct hidden-state trajectories during problem-solving, particularly in competitive programming, where harder tasks yield more direct paths. The findings indicate that longer reasoning chains can mislead comparisons unless adjusted for length, emphasizing the need for trajectory analysis corrections across various domains.

Key Points

Longer reasoning chains in models don't equate to better problem-solving without length adjustments.
Distinct trajectory geometry observed in reasoning-trained models, especially in competitive programming.
Difficulty remains linked to corrected trajectory geometry across competitive programming, math, and Boolean satisfiability.
Behavioral annotations indicate strategy shifts and uncertainty monitoring correlate with trajectory corrections.
Length correction is essential for accurate analysis of generation-time trajectories.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 14 May 2026]

View PDF HTML (experimental)

Abstract:Reasoning-trained language models often spend more tokens on harder problems, but longer chains of thought do not show whether a model is merely computing for more steps or following a different internal trajectory. We study this distinction through hidden-state trajectories during chain-of-thought generation across competitive programming, mathematics, and Boolean satisfiability. Raw trajectory geometry is strongly shaped by generation length: longer generations mechanically alter path statistics, so difficulty-dependent comparisons are misleading without adjustment. After residualizing trajectory statistics on length, difficulty remains systematically coupled to corrected trajectory geometry across all domains studied. The clearest reasoning-specific separation appears in the code domain, where harder problems show more direct corrected trajectories and less heterogeneous local curvature in reasoning-trained models than in matched instruction-tuned baselines. Corrected difficulty-geometry coupling is weaker, but still present, in mathematics and Boolean satisfiability. Prompt-stage linear probes do not mirror the code-domain separation, and behavioral annotations show that stronger corrected coupling co-occurs with strategy shifts and uncertainty monitoring. Together, these findings establish length correction as a prerequisite for generation-time trajectory analysis and show that reasoning training can be associated with distinct corrected trajectory geometry, with the strength of the effect depending on the domain.

Comments:	Preprint
Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as:	arXiv:2605.15454 [cs.CL]
	(or arXiv:2605.15454v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2605.15454 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Anders Gjølbye [view email]
[v1] Thu, 14 May 2026 22:37:33 UTC (653 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

1w ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

Reasoning Models Don't Just Think Longer, They Move Differently

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems