A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models
Quick Take
A multi-domain red teaming framework evaluated eleven LLMs, revealing performance variance with scores from 0.791 to 0.984. High-performing models like X-BAI, GPT-5, and Claude Opus 4.1 showed failures in critical scenarios, emphasizing the need for hybrid evaluations combining automated and human assessments for reliable safety metrics.
Key Points
- Evaluated 11 LLMs across 690 clinical scenarios in 9 domains.
- Performance scores varied significantly, with mean scores from 0.791 to 0.984.
- Top models X-BAI, GPT-5, and Claude Opus 4.1 scored above 0.97.
- Equity-related tasks showed 10-20% error amplification with demographic changes.
- Human reviewers identified failures missed by automated evaluations.
Article Content
From source RSS / original summaryarXiv:2606. 00027v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed across healthcare, yet existing benchmarks fail to capture model behavior under adversarial or ethically complex conditions common in clinical practice. We developed a multi-domain red teaming framework evaluating eleven contemporary LLMs across 690 clinically grounded scenarios spanning nine domains and over 150 subcategories.
Scenarios incorporated adversarial transformations, and responses were assessed using a seven-dimension rubric with LLM-assisted scoring and human-in-the-loop validation. Results revealed substantial performance variance, with mean scores ranging from 0. 791 to 0. 984. Critically, several high-performing systems produced complete failures in individual safety-critical scenarios, demonstrating that aggregate accuracy masks clinically meaningful risk. The highest-performing systems (X-BAI, GPT-5, Claude Opus 4.
1) achieved scores above 0. 97 with low variance, while performance varied significantly across domains. Equity-related tasks showed 10-20% error amplification with demographic modifications, and human reviewers identified clinically relevant failures missed by automated evaluation.
Our findings demonstrate that performance variance and worst-case failures provide more clinically meaningful reliability indicators than mean accuracy alone, and that hybrid evaluation approaches combining automation with clinician oversight are essential for credible safety assessment.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.