AI for Monitoring and Classifying Data Used in Research Literature
Quick Take
This paper introduces a multitask GLiNER-based framework for monitoring dataset usage in research literature, addressing the lack of infrastructure for dataset citations. By leveraging synthetic data and LLMs, the methodology enhances reliability and consistency in dataset mention extraction and classification, contributing to improved transparency and reproducibility in research.
Key Points
- Introduces a multitask GLiNER-based framework for dataset monitoring.
- Addresses inconsistent citation practices and scarce labeled data.
- Utilizes synthetic data generation for training example creation.
- Enhances reliability and consistency in dataset mention extraction.
- Aims to improve transparency and reproducibility in research.
Article Content
From source RSS / original summaryarXiv:2605. 30582v1 Announce Type: new Abstract: While platforms like Google Scholar and Semantic Scholar track citations for academic papers, no comparable infrastructure exists for monitoring dataset usage in research literature, leaving the landscape of data use largely opaque. Addressing this gap is critical for transparency, reproducibility, and monitoring of impact, yet progress is hindered by inconsistent citation practices, scarce labeled data, and ambiguous references to datasets in the wild.
Traditional NLP approaches struggle with these challenges, motivating the shift toward more adaptive, semantically rich models. Building on prior work using LLMs for data mention detection and synthetic data for bootstrapping training, this paper presents an updated methodology for scalable dataset monitoring. We introduce a multitask GLiNER-based framework that jointly performs dataset mention extraction, relation identification, and usage-context classification.
To address label scarcity, the pipeline leverages synthetic data generation to produce training examples and LLM-based revalidation to filter incorrect mentions and enforce labeling consistency, together improving reliability, coverage, and output consistency across the training pipeline. This work advances the development of open-source tools for monitoring data use in research literature, contributing to the broader goal of generalizable, unconstrained dataset citation tracking.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.