PSEBench: A Controllable and Verifiable Benchmark for Evaluating LLMs in Patient Safety Event Triage
Quick Answer
PSEBench introduces a 5,074-case benchmark for evaluating LLMs in patient safety event triage, utilizing a structured clause card methodology.
Quick Take
PSEBench introduces a 5,074-case benchmark for evaluating LLMs in patient safety event triage, utilizing a structured clause card methodology. Evaluation of 15 LLMs reveals consistent capability trends and identifies gaps in reliable triage processes, enhancing the decision-making for clinical events.
Key Points
- PSEBench is built on Minnesota's 29 Reportable Adverse Health Events.
- The benchmark supports generating missing information and handling ambiguous cases.
- 15 LLMs were evaluated, revealing trends and actionable gaps in performance.
- The methodology combines clause cards with anchor-driven instantiation.
- PSEBench aims to improve patient safety event triage processes.
Article Content
From source RSS / original summaryarXiv:2606. 05463v1 Announce Type: new Abstract: Patient safety event triage, determining whether a clinical event is reportable under jurisdiction-specific policy, is a high-stakes task typically performed manually by patient safety experts. Although LLMs may support this workflow, reliable evaluation is limited by the lack of benchmarks to capture evidence-grounded policy reasoning, proactive information seeking for incomplete reports, and principled abstention in irreducibly ambiguous cases.
We address this gap with a policy-grounded construction methodology centered on the clause card, a structured representation that factorizes regulatory text into auditable decision specifications. Combining clause cards with anchor-driven instantiation and closed-loop verification, our scalable pipeline produces narratives with by-construction ground truth and naturally supports generating missing information and uncertain variants.
We instantiate this method on Minnesota's 29 Reportable Adverse Health Events, producing PSEBench, a 5,074-case benchmark with an agentic evaluation environment. Evaluation on 15 representative LLMs reveals consistent capability trends, demonstrates the benchmark's utility, and identifies actionable gaps toward reliable LLM-based patient safety event triage.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?
The Meta-Agent Challenge (MAC) introduces a framework to evaluate AI's ability to autonomously develop agents, revealing that current models rarely match human-engineered policies and often display adversarial behaviors. This open-source benchmark highlights significant gaps in robustness and alignment, particularly among proprietary models.