Pretraining Language Models on Historical Text

arXiv cs.CL·Xiaoxi Luo, Zachary Shinnick, Niclas Griesshaber, Yixuan Wang, Junchi Yu, Freda Shi, Philip Torr, Yao Lu

2h ago

·~1 min·6/3/2026·en·0

Quick Take

TypewriterLM is a 7.24B parameter language model trained on pre-1913 English texts, addressing data quality and temporal consistency challenges. It utilizes the TypewriterCorpus, a 54B-token historical dataset, and introduces a novel instructing tuning framework to ensure responses are grounded in historical documents. The model and its resources are released to facilitate further research in historical language modeling.

Key Points

TypewriterLM trained exclusively on English texts before 1913 with 7.24 billion parameters.
TypewriterCorpus consists of 54 billion tokens from diverse archival sources.
Introduced lexically grounded instructing tuning for historical document responses.
History-Event benchmark suite evaluates competence and temporal consistency.
All resources released to support future research on historical language models.

Article Excerpt

From source RSS / original summary

arXiv:2606. 02991v1 Announce Type: new Abstract: We introduce TypewriterLM, a 7. 24B History language model (LM) trained exclusively on English text predating 1913. Developing History LMs requires addressing challenges in data quality and availability, preventing temporal leakage, designing temporally consistent post-training pipelines, and constructing reliable evaluations.

To address these issues, we construct TypewriterCorpus, a 54B-token historical corpus collected from diverse archival and linguistically annotated sources with extensive data cleaning and leakage mitigation procedures. Furthermore, we introduce lexically grounded instructing tuning, a post-training framework that constraints responses to remain directly grounded in historical source documents. Using this framework we construct two historical instruction tuning datasets: History-LIMA and History-SelfInstruct.

To evaluate capability and temporal consistency, we introduce History-Event, a benchmark suite for evaluating competence, temporal grounding and data leakage. We release TypewriterLM and all associated resources to support future research on historical language models.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

2w ago

FeaturedOriginal

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI Summary

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

#LLM #Agent #Inference #Policy