Towards Error-Free EHRs: Reasoning-Intensive Consistency Verification Between Clinical Notes and Structured Tables in Electronic Health Records
Quick Take
The study introduces EHR-ReasonCon, a benchmark for verifying consistency between clinical notes and structured tables in EHRs, featuring 8,048 entities. EHR-Inspector, an LLM-based framework, achieves state-of-the-art performance in consistency verification, outperforming human verification methods.
Key Points
- EHR-ReasonCon includes 8,048 entities from clinical notes with expert-guided annotations.
- EHR-Inspector uses LLMs to segment notes and verify consistency against structured tables.
- The framework achieved state-of-the-art performance under various evaluation criteria.
- Specialized tools support systematic evidence retrieval for reliable consistency assessment.
- Analysis shows significant differences in performance compared to human verification.
Article Content
From source RSS / original summaryarXiv:2605. 26463v1 Announce Type: new Abstract: Data consistency between unstructured clinical notes and structured tables in Electronic Health Records (EHRs) is essential for patient safety and clinical decision-making. However, existing work on note-table consistency verification mainly relies on surface-level matching of numeric values or simple events. Such approaches fail to capture the reasoning underlying real-world EHR documentation, including clinical interpretation, event relations, and temporal changes.
To address this gap, we introduce EHR-ReasonCon, a reasoning-intensive benchmark for note-table consistency verification. Built on MIMIC-III with expert-guided annotations, it comprises 8,048 entities derived from clinical notes and provides high-quality ground-truth labels. The annotation protocol is supported by specialized table-exploration tools to ensure systematic evidence retrieval and reliable consistency assessment.
We also propose EHR-Inspector, an LLM-based framework that segments notes, extracts anchor entities and temporal references, and uses table-exploration tools to verify consistency against structured tables. Evaluated using expert-validated LLM-as-a-judge metrics under harsh and lenient criteria, EHR-Inspector achieves state-of-the-art performance across multiple model backbones. Analyses further demonstrate the effectiveness of its components and highlight differences from human verification.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.