PSEBench: A Controllable and Verifiable Benchmark for Evaluating LLMs in Patient Safety Event Triage

arXiv cs.AI·Keqi Han, Ryan Young, Annabel Strauss, Lindsey Hughes, Katharine M. Nesbitt, Nicole Schueler, Che Ngufor, Carl Yang, Yuan Xue, Zhijun Yin

1d ago

·~1 min·6/6/2026·en·1

Quick Answer

PSEBench introduces a 5,074-case benchmark for evaluating LLMs in patient safety event triage, utilizing a structured clause card methodology.

Quick Take

PSEBench introduces a 5,074-case benchmark for evaluating LLMs in patient safety event triage, utilizing a structured clause card methodology. Evaluation of 15 LLMs reveals consistent capability trends and identifies gaps in reliable triage processes, enhancing the decision-making for clinical events.

Key Points

PSEBench is built on Minnesota's 29 Reportable Adverse Health Events.
The benchmark supports generating missing information and handling ambiguous cases.
15 LLMs were evaluated, revealing trends and actionable gaps in performance.
The methodology combines clause cards with anchor-driven instantiation.
PSEBench aims to improve patient safety event triage processes.

Article Content

From source RSS / original summary

arXiv:2606. 05463v1 Announce Type: new Abstract: Patient safety event triage, determining whether a clinical event is reportable under jurisdiction-specific policy, is a high-stakes task typically performed manually by patient safety experts. Although LLMs may support this workflow, reliable evaluation is limited by the lack of benchmarks to capture evidence-grounded policy reasoning, proactive information seeking for incomplete reports, and principled abstention in irreducibly ambiguous cases.

We address this gap with a policy-grounded construction methodology centered on the clause card, a structured representation that factorizes regulatory text into auditable decision specifications. Combining clause cards with anchor-driven instantiation and closed-loop verification, our scalable pipeline produces narratives with by-construction ground truth and naturally supports generating missing information and uncertain variants.

We instantiate this method on Minnesota's 29 Reportable Adverse Health Events, producing PSEBench, a 5,074-case benchmark with an agentic evaluation environment. Evaluation on 15 representative LLMs reveals consistent capability trends, demonstrates the benchmark's utility, and identifies actionable gaps toward reliable LLM-based patient safety event triage.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Xinyu Lu, Tianshu Wang, Pengbo Wang, zujie wen, Zhiqiang Zhang, Jun Zhou, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun

3d ago

FeaturedOriginal

The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?

AI Summary

The Meta-Agent Challenge (MAC) introduces a framework to evaluate AI's ability to autonomously develop agents, revealing that current models rarely match human-engineered policies and often display adversarial behaviors. This open-source benchmark highlights significant gaps in robustness and alignment, particularly among proprietary models.

#Agent #Open Source #AI Startup #Policy