The Annotation Scarcity Paradox in Low-Resource NLP Evaluation: A Decade of Acceleration and Emerging Constraints
Quick Take
The Annotation Scarcity Paradox highlights the disparity between NLP model scaling and evaluation expertise.
Key Points
- Low-resource NLP has rapidly evolved over the last decade.
- Evaluation expertise is unevenly distributed and marginalized.
- A shift to community-embedded evaluation is necessary.
📖 Reader Mode
~2 min readAbstract:Over the past decade, low-resource natural language processing (NLP) has experienced explosive growth, propelled by cross-lingual transfer, massively multilingual models, and the rapid proliferation of benchmarks. Yet this apparent progress masks a critical, insufficiently examined tension: the deep sociolinguistic expertise required to evaluate increasingly complex generative systems is severely strained, inequitably distributed, and structurally marginalised. We present a critical narrative survey of low-resource NLP evaluation (2014--present), tracing its evolution across three phases: early heuristic optimism, the illusions of top-down benchmark scaling, and the current era of generative bottlenecks. We conceptualise the \emph{Annotation Scarcity Paradox}, the structural friction arising when the technical capacity to scale models vastly outpaces the sovereign human infrastructure required to authentically evaluate them. By examining extractive data pipelines, undercompensated ``ghost work'', and language data flaring, we argue that this paradox threatens the epistemic validity of reported progress. We survey emerging responses -- including data augmentation, model-based evaluation, participatory curation, and annotation-efficient approaches via item response theory and active learning -- and assess their equity and validity trade-offs. We close with a practitioner call to action, arguing that overcoming this bottleneck requires a paradigm shift from transactional data extraction to relational, community-embedded evaluation rooted in epistemic governance, data sovereignty, and shared ownership.
| Comments: | Under Review |
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2605.19066 [cs.CL] |
| (or arXiv:2605.19066v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.19066 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Vukosi Marivate [view email]
[v1]
Mon, 18 May 2026 19:48:00 UTC (54 KB)
— Originally published at arxiv.org
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.