Predicting Poets' Origins from Verse: A Computational Analysis of Regional Linguistic Fingerprints in the Complete Tang Poems
Quick Answer
This paper shows that A computational analysis of Tang-dynasty poetry reveals geographic origins through linguistic fingerprints, achieving 0.69 accuracy in predicting poet regions using character n-gram TF-IDF.
Quick Take
A computational analysis of Tang-dynasty poetry reveals geographic origins through linguistic fingerprints, achieving 0.69 accuracy in predicting poet regions using character n-gram TF-IDF. The study highlights a distance-decay effect in poetic language and temporal variations in regional separability, suggesting interpretable machine learning can generate hypotheses for literary history.
Key Points
- Model predicts poet origins with 0.69 accuracy, surpassing the 0.53 baseline.
- Linguistic distance correlates with geographic distance (Mantel r=0.40).
- South/North separability varies over time, strongest in Late Tang.
- Misclassifications reflect historical prestige of northern court idiom.
- GuwenBERT transformer matches TF-IDF but does not outperform it.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 24093v1 Announce Type: new Abstract: We ask whether the geographic origin of Tang-dynasty poets leaves a detectable linguistic trace in their work. Aggregating every poem attributed to each author in the Complete Tang Poems (Quan Tang Shi) and linking poets to their administrative circuit of origin via the China Biographical Database (CBDB), we build a poet-level corpus of 357 poets across the ten Tang circuits and frame origin prediction as multi-class classification.
Using character $n$-gram TF-IDF together with interpretable domain features (imagery, season, and allusion), classical and neural models predict a poet's broad region (South vs. \ North) at $0. 69$ accuracy, well above the $0. 53$ majority baseline, and finer circuit-level origin above chance. Beyond classification, three findings emerge. (i) Linguistic distance between circuits grows with geographic distance (Mantel $r=0. 40$, $p\approx0.
09$ over nine circuits), evidence of a distance-decay effect in poetic language. (ii) The signal interacts with time: South/North separability is at chance in the High Tang and strongest in the Late Tang, consistent with court-driven homogenization at the empire's height followed by regional divergence. (iii) The model's confident errors are historically meaningful -- in the Early Tang, every misclassification is a southern poet read as northern, reflecting the prestige of the northern court idiom.
We further show that, when given the whole corpus through a hierarchical frozen-encoder representation, a classical-Chinese transformer (GuwenBERT) only matches -- not beats -- simple TF-IDF, and that combining them adds nothing, indicating that character $n$-grams already capture the regional signal. Our results position interpretable machine learning as a hypothesis generator for literary history.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.