Effort as Ceiling, Not Dial: Reasoning Budget Does Not Modulate Cognitive Cost Alignment Between Humans and Large Reasoning Models

arXiv cs.CL·Yueqing Hu, Tianhong Wang

5/19/2026

·~2 min·5/19/2026·en·3

Quick Answer

This paper shows that Research on Large Reasoning Models (LRMs) like GPT-OSS-20B and GPT-OSS-120B shows that cognitive cost alignment with human reasoning remains stable across varying inference efforts, indicating that this alignment is a product of training rather than real-time adjustments.

Quick Take

Research on Large Reasoning Models (LRMs) like GPT-OSS-20B and GPT-OSS-120B shows that cognitive cost alignment with human reasoning remains stable across varying inference efforts, indicating that this alignment is a product of training rather than real-time adjustments. The study suggests that the model's ability to track human cognitive difficulty is enhanced with scale, but the allocation policy is fixed post-training.

Key Points

Cognitive cost alignment in LRMs is invariant across three effort levels and six tasks.
Bayes Factors indicate no significant variation in alignment due to inference-time effort.
Token allocation reflects human difficulty patterns, improving with model scale.
The study supports a compiled account of LRM problem-solving rather than an online approach.
Effort parameter acts as a ceiling on generation, not a real-time allocation driver.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 16 May 2026]

View PDF HTML (experimental)

Abstract:Large Reasoning Models (LRMs) generate chain-of-thought traces whose length tracks human reaction times across cognitive tasks, but recent debate questions whether this alignment reflects genuine computational structure or surface verbosity. We test whether the alignment varies with inference-time reasoning effort. Across GPT-OSS-20B and GPT-OSS-120B, three effort levels, and six reasoning tasks, within-task and cross-task alignment remain invariant: Bayes Factors lean toward the null, and mean alignment is numerically near-identical across conditions. A manipulation check reveals that the effort parameter sets an upper budget on generation rather than driving real-time allocation, suggesting that the allocation policy is crystallized at training time. Arithmetic complexity contrasts further show that token allocation tracks fine-grained, format-dependent human difficulty patterns, with model scale improving the match. Cognitive cost alignment between LRMs and humans appears to be a training-time achievement, robust to inference-time perturbations, supporting a compiled rather than online account of LRM problem-solving.

Comments:	8 pages, 6 figures
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
Cite as:	arXiv:2605.16938 [cs.CL]
	(or arXiv:2605.16938v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2605.16938 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Tianhong Wang [view email]
[v1] Sat, 16 May 2026 11:20:01 UTC (382 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

1w ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

Effort as Ceiling, Not Dial: Reasoning Budget Does Not Modulate Cognitive Cost Alignment Between Humans and Large Reasoning Models

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems