What Do Biomedical NER and Entity Linking Benchmarks Measure? A Corpus-Centric Diagnostic Framework

arXiv cs.CL·Robert Leaman, Rezarta Islamaj, Zhiyong Lu

8h ago

·~2 min·5/21/2026·en·0

Quick Take

A framework for evaluating biomedical NER and EL benchmarks based on corpus characteristics.

Key Points

Analyzes benchmark properties from annotated corpora.
Identifies significant differences across various biomedical datasets.
Provides open-source tools for further analysis.

📖 Reader Mode

~2 min read

[Submitted on 19 May 2026]

View PDF HTML (experimental)

Abstract:Biomedical named entity recognition (NER) and entity linking (EL) strongly depend on annotated corpora, but the utility of these resources for benchmarking is often assumed rather than characterized. We present a corpus-centric framework for diagnosing benchmark-relevant properties directly from corpus annotations, concept links, train-test splits, document metadata, and terminology mappings. The framework organizes standardized statistics into five families: (1) scale, density and label distribution, (2) lexical and conceptual structure, (3) train-test overlap, (4) metadata composition, and (5) terminology coverage where applicable. Applying the framework to nine corpora spanning diseases, chemicals, and cell types, we find that corpus properties can differ substantially, even when they address the same apparent task. We find differences in the evaluation signal they provide, the generalization demands they impose, the degree of train-test reuse they permit, and the regions of biomedical literature and concept space they represent. These differences suggest that commonly reported corpus statistics can be insufficient to characterize what biomedical NER and EL benchmarks evaluate. We argue that corpus-centric diagnostics provide a practical framework for analyzing corpora beyond surface descriptors such as corpus size and entity type, for identifying potential transfer risks, and for interpreting the scope of benchmarking conclusions. We release the framework as open-source code with an interactive dashboard to support reproducing our analyses and characterizing additional corpora.

Comments:	Accepted to the ACL 25th Workshop on Biomedical Language Processing
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2605.20537 [cs.CL]
	(or arXiv:2605.20537v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2605.20537 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Robert Leaman [view email]
[v1] Tue, 19 May 2026 22:19:22 UTC (373 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

What Do Biomedical NER and Entity Linking Benchmarks Measure? A Corpus-Centric Diagnostic Framework

Quick Take

Key Points

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

Related in this space

Verifiable Agentic Infrastructure: Proof-Derived Authorization for Sovereign AI Systems

Nvidia says it has ‘largely conceded’ China’s AI chip market to Huawei