Stress-testing medical large language models reveals latent safety pathology beyond benchmark accuracy

arXiv cs.AI·Yuan Shen, Xiaojun Wu, Linghua Yu

6/9/2026

·~1 min·6/9/2026·en·1

Quick Answer

The study introduces AI-MASLD, a stress-testing framework for clinical LLMs, revealing that while models perform well under clean conditions, they exhibit significant performance divergence under realistic stress, highlighting safety issues overlooked by traditional benchmarks.

Quick Take

Notably, an outperformed proprietary models across safety metrics, emphasizing the need for narrative stress auditing in evaluations.

Key Points

AI-MASLD stress-tests seven clinical LLMs using 240 cases and three performance indices.
Models showed uniform performance under clean conditions but diverged sharply under narrative stress.
Quantized models displayed pseudonormalization, masking functional collapse with low flip rates.
Medical fine-tuning degraded logical stability and fairness in LLMs.
An open-weight model consistently outperformed proprietary counterparts on safety dimensions.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Source Excerpt

arXiv:2606. 07929v1 Announce Type: new Abstract: (LLMs) are entering clinical practice based on benchmark accuracy that may fail to detect safety-relevant failure modes. Here we present AI-MASLD, a stress-audit framework that adapts the logic of metabolic stress testing from hepatology to the evaluation of clinical LLMs.

Using 240 clinical cases across six narrative perturbation probes, we subjected seven models to double-stress testing and quantified performance through three indices: metabolic index (MI), perturbation flip rate (PFR), and counterfactual fairness index (CFI). Under clean baseline conditions, all models performed uniformly well. …

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Vinil Pasupuleti, Shyalendar Reddy Allala, Siva Rama Krishna Varma Bayyavarapu, Shrey Tyagi, Srinivasateja Songa

3h ago

FeaturedOriginal

AINTMA: Agentic AI Architecture for Autonomous Test Management with Generative Intelligence, Secure Cloud Communication and Adaptive Quality Analytics

AI Summary

AINTMA, an autonomous test management architecture utilizing six specialized AI agents, achieves 88.4% test prioritization accuracy and reduces defect escape rates from 8.3% to 2.1%. The system demonstrates a 340% ROI within nine months, showcasing the potential of agentic AI in enhancing software quality management in cloud environments.

#Agent #AI Coding #Security #Enterprise AI

Stress-testing medical large language models reveals latent safety pathology beyond benchmark accuracy

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.AI

AINTMA: Agentic AI Architecture for Autonomous Test Management with Generative Intelligence, Secure Cloud Communication and Adaptive Quality Analytics

RAIL Guard: Closing the Evaluation-to-Remediation Gap in Responsible AI for Agents

Automatic Ordinary Differential Equations Discovery For Biological Systems Using Powered Agentic System

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.AI

AINTMA: Agentic AI Architecture for Autonomous Test Management with Generative Intelligence, Secure Cloud Communication and Adaptive Quality Analytics

RAIL Guard: Closing the Evaluation-to-Remediation Gap in Responsible AI for LLM Agents

Automatic Ordinary Differential Equations Discovery For Biological Systems Using Large Language Model Powered Agentic System

RAIL Guard: Closing the Evaluation-to-Remediation Gap in Responsible AI for Agents

Automatic Ordinary Differential Equations Discovery For Biological Systems Using Powered Agentic System