Effort as Ceiling, Not Dial: Reasoning Budget Does Not Modulate Cognitive Cost Alignment Between Humans and Large Reasoning Models
Quick Take
Cognitive cost alignment between humans and large reasoning models is a training-time phenomenon, unaffected by inference effort.
Key Points
- Alignment remains consistent across different reasoning tasks.
- Effort parameter limits generation budget, not real-time allocation.
- Model scale enhances alignment with human cognitive patterns.
📖 Reader Mode
~2 min readAbstract:Large Reasoning Models (LRMs) generate chain-of-thought traces whose length tracks human reaction times across cognitive tasks, but recent debate questions whether this alignment reflects genuine computational structure or surface verbosity. We test whether the alignment varies with inference-time reasoning effort. Across GPT-OSS-20B and GPT-OSS-120B, three effort levels, and six reasoning tasks, within-task and cross-task alignment remain invariant: Bayes Factors lean toward the null, and mean alignment is numerically near-identical across conditions. A manipulation check reveals that the effort parameter sets an upper budget on generation rather than driving real-time allocation, suggesting that the allocation policy is crystallized at training time. Arithmetic complexity contrasts further show that token allocation tracks fine-grained, format-dependent human difficulty patterns, with model scale improving the match. Cognitive cost alignment between LRMs and humans appears to be a training-time achievement, robust to inference-time perturbations, supporting a compiled rather than online account of LRM problem-solving.
| Comments: | 8 pages, 6 figures |
| Subjects: | Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC) |
| Cite as: | arXiv:2605.16938 [cs.CL] |
| (or arXiv:2605.16938v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.16938 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Tianhong Wang [view email]
[v1]
Sat, 16 May 2026 11:20:01 UTC (382 KB)
— Originally published at arxiv.org
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.