Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs
Quick Take
This study evaluates the semantic stability of clinical LLMs, revealing that model robustness varies significantly across domain-specific and general-purpose models. A new semantic verification framework and metrics were introduced to assess sensitivity to prompt variations, showing mixed results in robustness among 16 evaluated models.
Key Points
- Proposed a semantic verification framework using Natural Language Inference (NLI).
- Evaluated 16 open-source LLMs, revealing mixed robustness across models.
- Introduced metrics: Meaning-Preserving Variation Sensitivity (MVS) and Worst-Case Instability (WCI).
- Domain specialization does not consistently enhance robustness against prompt variations.
- Some domain-specific models outperformed general-purpose models in robustness.
Article Content
From source RSS / original summaryarXiv:2605. 30646v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly used in clinical applications. However, their behavior remains highly sensitive to subtle linguistic variations, such as rephrasing or syntactic variation. This sensitivity poses risks in safety-critical healthcare settings, where semantically equivalent inputs should produce consistent predictions.
However, a key challenge is to ensure that prompt variations truly preserve clinical meaning, as embedding-based similarity metrics often fail to capture distinctions involving negation, temporality, or severity. To address this limitation, we propose a semantic verification framework based on Natural Language Inference (NLI) to filter meaning-preserving prompt variations, which are further refined using an LLM-as-a-judge and audited by a clinical expert.
In addition, we introduce three metrics to quantify model sensitivity: MeaningPreserving Variation Sensitivity (MVS), confidence variation (\Delta C), and Worst-Case Instability (WCI). We evaluate 16 open-source general-purpose (GP) and medical LLMs within the same model families and parameter scales, using reformulated prompts derived from the DiagnosisQA and MedQA datasets. Our results demonstrate that robustness differences between domain-specific (DS) models are mixed and highly model-dependent, i. e.
, domain specialization does not consistently improve or reduce robustness to meaning-preserving prompt reformulations. Several DS models rank among the most robust (when compared with GP counterparts), and strong GP baselines remain competitive as well.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.