A Reproducible Universal Dependencies-Style Pipeline for Katharevousa Greek Parliamentary Text

arXiv cs.CL·George Mikros, Fotios Fitsilis

5/25/2026

·~2 min·5/25/2026·en·2

Quick Answer

This paper shows that A new reproducible NLP pipeline for Katharevousa Greek parliamentary text integrates OCR, LLM-assisted annotation, and model comparison.

Quick Take

A new reproducible NLP pipeline for Katharevousa Greek parliamentary text integrates OCR, LLM-assisted annotation, and model comparison. The XLM-R model achieved 0.8893 UPOS accuracy and 0.5162 LAS, outperforming spaCy Greek by 0.0980 LAS. This work provides a reusable syntactic infrastructure for historical texts.

Key Points

Pipeline includes OCR reconstruction, LLM-assisted annotation, and model-family comparison.
1,697 sentences were validated, with 1,357 for training and 340 for testing.
XLM-R model achieved 0.8893 UPOS accuracy and 0.5162 LAS.
spaCy Greek reached 0.4183 LAS, indicating substantial register mismatch.
The entire pipeline is available as open-access for further research.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2605. 22978v1 Announce Type: new Abstract: Katharevousa Greek remains poorly served by contemporary NLP pipelines despite its importance for legal, administrative, and parliamentary archives. We present a reproducible workflow for building and evaluating a Universal Dependencies-style parsing resource for Katharevousa parliamentary questions from Greece's early post-junta period.

The pipeline links OCR-aware reconstruction, schema-constrained LLM-assisted annotation, automatic validation, deterministic CoNLL-U snapshotting, fixed-split evaluation, and model-family comparison. The frozen automatically validated reference set contains 1{,}697 sentences, split into 1{,}357 training sentences and 340 held-out test sentences. We compare off-the-shelf Greek and Ancient Greek parsers, a feature-based parser, mBERT, XLM-R, and custom Stanza training under the same scoring protocol.

Off-the-shelf systems show substantial register mismatch: the strongest external baseline, spaCy Greek, reaches 0. 4183 LAS. The best structural parser, an XLM-R model, reaches 0. 8893 UPOS accuracy, 0. 7250 dependency-relation F1, 0. 6098 UAS, and 0. 5162 LAS, an absolute LAS gain of 0. 0980 over the best external baseline. The feature-based model remains competitive for UPOS and relation labeling, indicating that transparent lexical-context features still matter at this data scale.

Beyond scores, the paper contributes an auditable methodology for turning difficult historical parliamentary OCR into reusable syntactic NLP infrastructure. The entire pipeline -- code, schema, frozen reference annotations, fixed train/test split, and per-model benchmark reports -- is released as an open-access companion to this paper.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Miguel Arana-Catania, Catherine Conisbee, Matthew Kidd

1d ago

FeaturedOriginal

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

AI Summary

The study evaluates three NLP approaches—Named Entity Recognition, Keyword Extraction, and Topic Modelling—using the Their Finest Hour Online Archive to automate keyword extraction from crowdsourced WWII collections. Findings suggest that while NLP methods show promise, no single approach is sufficient, and ethical considerations in automated keyword extraction are crucial for responsible stewardship.

#AI Coding #Inference #Open Source #Policy

A Reproducible Universal Dependencies-Style Pipeline for Katharevousa Greek Parliamentary Text

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Quantifying Prior Dominance in Systems