Pretraining Language Models on Historical Text
Quick Take
TypewriterLM is a 7.24B parameter language model trained on pre-1913 English texts, addressing data quality and temporal consistency challenges. It utilizes the TypewriterCorpus, a 54B-token historical dataset, and introduces a novel instructing tuning framework to ensure responses are grounded in historical documents. The model and its resources are released to facilitate further research in historical language modeling.
Key Points
- TypewriterLM trained exclusively on English texts before 1913 with 7.24 billion parameters.
- TypewriterCorpus consists of 54 billion tokens from diverse archival sources.
- Introduced lexically grounded instructing tuning for historical document responses.
- History-Event benchmark suite evaluates competence and temporal consistency.
- All resources released to support future research on historical language models.
Article Excerpt
From source RSS / original summaryarXiv:2606. 02991v1 Announce Type: new Abstract: We introduce TypewriterLM, a 7. 24B History language model (LM) trained exclusively on English text predating 1913. Developing History LMs requires addressing challenges in data quality and availability, preventing temporal leakage, designing temporally consistent post-training pipelines, and constructing reliable evaluations.
To address these issues, we construct TypewriterCorpus, a 54B-token historical corpus collected from diverse archival and linguistically annotated sources with extensive data cleaning and leakage mitigation procedures. Furthermore, we introduce lexically grounded instructing tuning, a post-training framework that constraints responses to remain directly grounded in historical source documents. Using this framework we construct two historical instruction tuning datasets: History-LIMA and History-SelfInstruct.
To evaluate capability and temporal consistency, we introduce History-Event, a benchmark suite for evaluating competence, temporal grounding and data leakage. We release TypewriterLM and all associated resources to support future research on historical language models.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.