MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction
Quick Take
MedicalBench introduces a benchmark for evaluating implicit medical concept extraction in electronic health records.
Key Points
- Focuses on implicit medical reasoning over explicit concepts.
- Includes a dataset from MIMIC-IV with expert-reviewed annotations.
- Highlights modest performance of state-of-the-art LLMs.
📖 Reader Mode
~2 min readAbstract:Medical concept extraction from electronic health records underpins many downstream applications, yet remains challenging because medically meaningful concepts are frequently implied rather than explicitly stated in medical narratives. Existing benchmarks with human-annotated evidence spans underscore the importance of grounding extracted concepts in medical text. However, they predominantly focus on explicitly stated concepts instead of implicit concepts. We present MedicalBench, a benchmark for medical concept extraction with evidence grounding that evaluates implicit medical reasoning. MedicalBench formulates medical concept extraction as a verification task over medical note-concept pairs, coupled with sentence-level evidence identification. Built from MIMIC-IV discharge summaries and human-verified ICD-10 codes, the dataset is curated through a multi-stage large language model (LLM) triage pipeline followed by medical annotation and expert review. It deliberately includes implicit positives, semantically confusable negatives, and cases where LLM judgments disagree with medical expert assessments. We define two complementary evaluation tasks: (1) medical concept extraction and (2) sentence-level evidence retrieval, enabling assessment of both correctness and interpretability. Benchmarking state-of-the-art LLMs reveals that performance remains modest, highlighting the difficulty of extracting implicitly expressed concepts. We further show that performance is largely invariant to note length, indicating that MedicalBench isolates reasoning difficulty rather than superficial confounders. MedicalBench provides the first systematic benchmark for implicit, evidence-grounded medical concept extraction, offering a foundation for developing medical language models that can both identify medically relevant concepts and justify their predictions in a transparent and medically faithful manner.
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2605.20197 [cs.CL] |
| (or arXiv:2605.20197v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.20197 arXiv-issued DOI via DataCite |
Submission history
From: Sanjit Singh Batra [view email]
[v1]
Sun, 5 Apr 2026 14:11:00 UTC (716 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.