Stress-testing medical large language models reveals latent safety pathology beyond benchmark accuracy
Quick Answer
The study introduces AI-MASLD, a stress-testing framework for clinical LLMs, revealing that while models perform well under clean conditions, they exhibit significant performance divergence under realistic stress, highlighting safety issues overlooked by traditional benchmarks.
Quick Take
The study introduces AI-MASLD, a stress-testing framework for clinical LLMs, revealing that while models perform well under clean conditions, they exhibit significant performance divergence under realistic stress, highlighting safety issues overlooked by traditional benchmarks. Notably, an outperformed proprietary models across safety metrics, emphasizing the need for narrative stress auditing in LLM evaluations.
Key Points
- AI-MASLD stress-tests seven clinical LLMs using 240 cases and three performance indices.
- Models showed uniform performance under clean conditions but diverged sharply under narrative stress.
- Quantized models displayed pseudonormalization, masking functional collapse with low flip rates.
- Medical fine-tuning degraded logical stability and fairness in LLMs.
- An open-weight model consistently outperformed proprietary counterparts on safety dimensions.
Article Excerpt
From source RSS / original summaryarXiv:2606. 07929v1 Announce Type: new Abstract: Large language models (LLMs) are entering clinical practice based on benchmark accuracy that may fail to detect safety-relevant failure modes. Here we present AI-MASLD, a stress-audit framework that adapts the logic of metabolic stress testing from hepatology to the evaluation of clinical LLMs.
Using 240 clinical cases across six narrative perturbation probes, we subjected seven models to double-stress testing and quantified performance through three indices: metabolic index (MI), perturbation flip rate (PFR), and counterfactual fairness index (CFI). Under clean baseline conditions, all models performed uniformly well. Under realistic narrative stress, performance diverged sharply, revealing two distinct stress-response phenotypes.
Quantized models exhibited pseudonormalization, in which low flip rates hid functional collapse. Medical supervised fine-tuning systematically degraded logical stability, fairness, and information extraction. An matched or exceeded proprietary alternatives on every safety dimension. These findings establish narrative stress auditing as a necessary complement to accuracy-based evaluation.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective
This paper addresses the sim-to-real gap for foundation model agents by framing it within a Markov Decision Process (MDP) structure. It advocates for established solutions like domain randomization to enhance agent robustness, aiming to create standardized benchmarks for reliable real-world applications.