MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction

arXiv cs.CL·Zhichao Yang, Gregory D. Lyng, Sanjit Singh Batra, Robert E. Tillman

8h ago

·~2 min·5/21/2026·en·0

Quick Take

MedicalBench introduces a benchmark for evaluating implicit medical concept extraction in electronic health records.

Key Points

Focuses on implicit medical reasoning over explicit concepts.
Includes a dataset from MIMIC-IV with expert-reviewed annotations.
Highlights modest performance of state-of-the-art LLMs.

📖 Reader Mode

~2 min read

[Submitted on 5 Apr 2026]

View PDF HTML (experimental)

Abstract:Medical concept extraction from electronic health records underpins many downstream applications, yet remains challenging because medically meaningful concepts are frequently implied rather than explicitly stated in medical narratives. Existing benchmarks with human-annotated evidence spans underscore the importance of grounding extracted concepts in medical text. However, they predominantly focus on explicitly stated concepts instead of implicit concepts. We present MedicalBench, a benchmark for medical concept extraction with evidence grounding that evaluates implicit medical reasoning. MedicalBench formulates medical concept extraction as a verification task over medical note-concept pairs, coupled with sentence-level evidence identification. Built from MIMIC-IV discharge summaries and human-verified ICD-10 codes, the dataset is curated through a multi-stage large language model (LLM) triage pipeline followed by medical annotation and expert review. It deliberately includes implicit positives, semantically confusable negatives, and cases where LLM judgments disagree with medical expert assessments. We define two complementary evaluation tasks: (1) medical concept extraction and (2) sentence-level evidence retrieval, enabling assessment of both correctness and interpretability. Benchmarking state-of-the-art LLMs reveals that performance remains modest, highlighting the difficulty of extracting implicitly expressed concepts. We further show that performance is largely invariant to note length, indicating that MedicalBench isolates reasoning difficulty rather than superficial confounders. MedicalBench provides the first systematic benchmark for implicit, evidence-grounded medical concept extraction, offering a foundation for developing medical language models that can both identify medically relevant concepts and justify their predictions in a transparent and medically faithful manner.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2605.20197 [cs.CL]
	(or arXiv:2605.20197v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2605.20197 arXiv-issued DOI via DataCite

Submission history

From: Sanjit Singh Batra [view email]
[v1] Sun, 5 Apr 2026 14:11:00 UTC (716 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction

Quick Take

Key Points

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

Related in this space

From Prompts to Protocols: An AI Agent for Laboratory Automation

Agentic Trading: When LLM Agents Meet Financial Markets