When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering

arXiv cs.CL·Doeun Lee, Muge Zhang, Yi Yu, Ashish Manne, Stephen Koesters, Frank Wen, Brady Buchanan, Lynda Villagomez, Oluwatoba Moninuola, James Lim, Kathryn Tobin, Andrew Srisuwananukorn, Ping Zhang, Sachin Kumar

5/22/2026

·~2 min·5/22/2026·en·3

Quick Answer

The OGCaReBench benchmark evaluates LLMs like GPT-5.2 on clinical questions beyond guidelines, achieving a 56% accuracy, improved to 82% with retrieved articles.

Quick Take

The OGCaReBench benchmark evaluates LLMs like GPT-5.2 on clinical questions beyond guidelines, achieving a 56% accuracy, improved to 82% with retrieved articles. This highlights the need for evidence-based reasoning in rare medical scenarios.

Key Points

OGCaReBench focuses on free-form clinical question answering beyond standard guidelines.
GPT-5.2 achieves 56% accuracy on the benchmark, with specialized models at 42%.
Retrieving medical articles boosts performance to 82% for GPT-5.2.
The benchmark is derived from validated medical case reports.
This work aims to enhance reliability in challenging clinical contexts.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 20 May 2026]

Authors:Doeun Lee, Muge Zhang, Yi Yu, Ashish Manne, Stephen Koesters, Frank Wen, Brady Buchanan, Lynda Villagomez, Oluwatoba Moninuola, James Lim, Kathryn Tobin, Andrew Srisuwananukorn, Ping Zhang, Sachin Kumar

View PDF HTML (experimental)

Abstract:Across medical specialties, clinical practice is anchored in evidence-based guidelines that codify best studied diagnostic and treatment pathways. These pathways routinely fall short for the long tail of real-world care not covered by guidelines. Most medical large language models (LLMs), however, are trained to encode common, guideline-focused medical knowledge in their parameters. Current evaluations test models primarily on recalling and reasoning with this memorized content, often in multiple-choice settings. Given the fundamental importance of evidence-based reasoning in medicine, it is neither feasible nor reliable to depend on memorization in practice. To address this gap, we introduce OGCaReBench, a free-form retrieval-focused benchmark aimed at evaluating LLMs at answering clinical questions that require going beyond typical guidelines. Extracted from published medical case reports and validated by medical experts, OGCaReBench contains long-form clinical questions requiring free-text answers, providing a systematic framework for assessing open-ended medical reasoning in rare, case-based scenarios. Our experiments reveal that even the best-performing baseline (GPT-5.2) correctly answers only 56% of our benchmark with specialized models only reaching 42%. Augmenting models with retrieved medical articles improves this performance to up to 82% (using GPT-5.2) highlighting the importance of evidence-grounding for real-world medical reasoning tasks. This work thus establishes a foundation for benchmarking and advancing both general-purpose and medical LLMs to produce reliable answers in challenging clinical contexts.

Comments:	34 pages, 20 figures
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2605.21807 [cs.CL]
	(or arXiv:2605.21807v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2605.21807 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Doeun Lee [view email]
[v1] Wed, 20 May 2026 23:04:48 UTC (1,142 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

1w ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems