HRM-Text: Efficient Pretraining Beyond Scaling

arXiv cs.CL·Guan Wang, Changling Liu, Chenyu Wang, Cai Zhou, Yuhao Sun, Yifei Wu, Shuai Zhen, Luca Scimeca, Yasin Abbasi Yadkori

5/21/2026

·~2 min·5/21/2026·en·16

Quick Answer

HRM-Text introduces a Hierarchical Recurrent Model for efficient pretraining, achieving competitive performance with only 40 billion tokens and a $1,500 budget.

Quick Take

HRM-Text introduces a Hierarchical Recurrent Model for efficient pretraining, achieving competitive performance with only 40 billion tokens and a $1,500 budget. The model scores 60.7% on and outperforms traditional methods by using instruction-response pairs instead of raw text, significantly reducing compute requirements.

Key Points

HRM-Text replaces Transformers with a Hierarchical Recurrent Model for improved efficiency.
Achieved 60.7% on MMLU using only 40 billion unique tokens.
Utilized $1,500 budget, outperforming traditional models with 100-900x fewer training tokens.
Introduced MagicNorm and warmup deep credit assignment for stable language modeling.
Demonstrates that co-designing architectures can lower compute-to-performance ratios.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 20 May 2026]

View PDF HTML (experimental)

Abstract:The current pretraining paradigm for large language models relies on massive compute and internet-scale raw text, creating a significant barrier to foundational research. In contrast, biological systems demonstrate highly sample-efficient learning through multi-timescale processing, such as the functional organization of the frontoparietal loop. Taking this as inspiration, we introduce HRM-Text, which replaces standard Transformers with a Hierarchical Recurrent Model (HRM) that decouples computation into slow-evolving strategic and fast-evolving execution layers. To stabilize this deep recurrence for language modeling, we introduce MagicNorm and warmup deep credit assignment. Furthermore, instead of standard raw-text pretraining, we train exclusively on instruction-response pairs using a task-completion objective and PrefixLM masking. Serving as an empirical existence proof of efficient pretraining, a 1B-parameter HRM-Text model trained from scratch on only 40 billion unique tokens and $1,500 budget achieves 60.7% on MMLU, 81.9% on ARC-C, 82.2% on DROP, 84.5% on GSM8K, and 56.2% on MATH. Despite utilizing roughly 100-900x fewer training tokens and 96-432x less estimated compute than standard baselines, HRM-Text performs competitively with 2-7B parameter open models. These results demonstrate that co-designing architectures and objectives can radically reduce the compute-to-performance ratio, making pretraining from scratch accessible to the broader research community.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2605.20613 [cs.CL]
	(or arXiv:2605.20613v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2605.20613 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Yuhao Sun [view email]
[v1] Wed, 20 May 2026 01:59:50 UTC (2,349 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

1w ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

HRM-Text: Efficient Pretraining Beyond Scaling

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems