Predicting Poets' Origins from Verse: A Computational Analysis of Regional Linguistic Fingerprints in the Complete Tang Poems

arXiv cs.CL·Chi-Sheng Chen, Hung-Yun Liu

4h ago

·~2 min·6/24/2026·en·0

Quick Answer

This paper shows that A computational analysis of Tang-dynasty poetry reveals geographic origins through linguistic fingerprints, achieving 0.69 accuracy in predicting poet regions using character n-gram TF-IDF.

Quick Take

A computational analysis of Tang-dynasty poetry reveals geographic origins through linguistic fingerprints, achieving 0.69 accuracy in predicting poet regions using character n-gram TF-IDF. The study highlights a distance-decay effect in poetic language and temporal variations in regional separability, suggesting interpretable machine learning can generate hypotheses for literary history.

Key Points

Model predicts poet origins with 0.69 accuracy, surpassing the 0.53 baseline.
Linguistic distance correlates with geographic distance (Mantel r=0.40).
South/North separability varies over time, strongest in Late Tang.
Misclassifications reflect historical prestige of northern court idiom.
GuwenBERT transformer matches TF-IDF but does not outperform it.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 24093v1 Announce Type: new Abstract: We ask whether the geographic origin of Tang-dynasty poets leaves a detectable linguistic trace in their work. Aggregating every poem attributed to each author in the Complete Tang Poems (Quan Tang Shi) and linking poets to their administrative circuit of origin via the China Biographical Database (CBDB), we build a poet-level corpus of 357 poets across the ten Tang circuits and frame origin prediction as multi-class classification.

Using character $n$-gram TF-IDF together with interpretable domain features (imagery, season, and allusion), classical and neural models predict a poet's broad region (South vs. \ North) at $0. 69$ accuracy, well above the $0. 53$ majority baseline, and finer circuit-level origin above chance. Beyond classification, three findings emerge. (i) Linguistic distance between circuits grows with geographic distance (Mantel $r=0. 40$, $p\approx0.

09$ over nine circuits), evidence of a distance-decay effect in poetic language. (ii) The signal interacts with time: South/North separability is at chance in the High Tang and strongest in the Late Tang, consistent with court-driven homogenization at the empire's height followed by regional divergence. (iii) The model's confident errors are historically meaningful -- in the Early Tang, every misclassification is a southern poet read as northern, reflecting the prestige of the northern court idiom.

We further show that, when given the whole corpus through a hierarchical frozen-encoder representation, a classical-Chinese transformer (GuwenBERT) only matches -- not beats -- simple TF-IDF, and that combining them adds nothing, indicating that character $n$-grams already capture the regional signal. Our results position interpretable machine learning as a hypothesis generator for literary history.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

4h ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

Predicting Poets' Origins from Verse: A Computational Analysis of Regional Linguistic Fingerprints in the Complete Tang Poems

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems