A Reproducible Universal Dependencies-Style Pipeline for Katharevousa Greek Parliamentary Text
Quick Take
A new reproducible NLP pipeline for Katharevousa Greek parliamentary text integrates OCR, LLM-assisted annotation, and model comparison. The XLM-R model achieved 0.8893 UPOS accuracy and 0.5162 LAS, outperforming spaCy Greek by 0.0980 LAS. This work provides a reusable syntactic infrastructure for historical texts.
Key Points
- Pipeline includes OCR reconstruction, LLM-assisted annotation, and model-family comparison.
- 1,697 sentences were validated, with 1,357 for training and 340 for testing.
- XLM-R model achieved 0.8893 UPOS accuracy and 0.5162 LAS.
- spaCy Greek reached 0.4183 LAS, indicating substantial register mismatch.
- The entire pipeline is available as open-access for further research.
Article Content
From source RSS / original summaryarXiv:2605. 22978v1 Announce Type: new Abstract: Katharevousa Greek remains poorly served by contemporary NLP pipelines despite its importance for legal, administrative, and parliamentary archives. We present a reproducible workflow for building and evaluating a Universal Dependencies-style parsing resource for Katharevousa parliamentary questions from Greece's early post-junta period.
The pipeline links OCR-aware reconstruction, schema-constrained LLM-assisted annotation, automatic validation, deterministic CoNLL-U snapshotting, fixed-split evaluation, and model-family comparison. The frozen automatically validated reference set contains 1{,}697 sentences, split into 1{,}357 training sentences and 340 held-out test sentences. We compare off-the-shelf Greek and Ancient Greek parsers, a feature-based parser, mBERT, XLM-R, and custom Stanza training under the same scoring protocol.
Off-the-shelf systems show substantial register mismatch: the strongest external baseline, spaCy Greek, reaches 0. 4183 LAS. The best structural parser, an XLM-R model, reaches 0. 8893 UPOS accuracy, 0. 7250 dependency-relation F1, 0. 6098 UAS, and 0. 5162 LAS, an absolute LAS gain of 0. 0980 over the best external baseline. The feature-based model remains competitive for UPOS and relation labeling, indicating that transparent lexical-context features still matter at this data scale.
Beyond scores, the paper contributes an auditable methodology for turning difficult historical parliamentary OCR into reusable syntactic NLP infrastructure. The entire pipeline -- code, schema, frozen reference annotations, fixed train/test split, and per-model benchmark reports -- is released as an open-access companion to this paper.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.