When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering
Quick Take
OGCaReBench benchmarks LLMs on clinical questions beyond guidelines, revealing gaps in current models.
Key Points
- Focuses on rare, case-based clinical questions.
- Best model answers only 56% of benchmark correctly.
- Evidence-grounding improves performance to 82%.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.