Readable but Not Controllable: Neuron-Level Evidence for Medical LLM Hallucination
Quick Answer
This paper shows that Hallucination in medical LLMs poses significant challenges, with AUROC scores of 0.77 to 0.86 for detection.
Quick Take
Hallucination in medical LLMs poses significant challenges, with AUROC scores of 0.77 to 0.86 for detection. However, while internal signals are detectable, they do not allow for reliable neuron-level control, indicating a fundamental disconnect between detection and correction.
Key Points
- Four open-source models were tested on medical question-answering datasets.
- A conditioned probe reliably detects hallucinations with AUROC scores between 0.77 and 0.86.
- Systematic neuron selection outperforms random selection only in very small subsets.
- Detection does not equate to controllability across 16 model-dataset combinations.
- Findings suggest deeper issues in hallucination mitigation beyond neuron identification.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2607. 00158v1 Announce Type: new Abstract: Hallucination remains one of the central obstacles to deploying medical LLMs. Yet, even when hallucination can be detected, it is still unclear whether the internal representations associated with it can be used for control rather than detection alone. Using four open-source models across a suite of medical question-answering datasets, we show that a simple, carefully conditioned probe can reliably detect hallucination, with AUROC scores between 0. 77 and 0.
86 in our case. We further show that this signal is distributed and redundant rather than narrowly localized. Systematically selected neurons outperform random neurons only at very small subset sizes, whereas random subsets of a few hundred neurons recover nearly the full signal, and low-dimensional random projections preserve most of the detection performance. Beyond detection, we test whether this representation is causally actionable.
Across 16 model--dataset combinations, our results reveal a sharp gap between decodability and controllability. The same internal structure that makes hallucination easy to detect does not translate into reliable neuron-level control. These findings show that medical hallucination seems to be readily visible in internal activations, but not easily corrected by steering the neurons most associated with it.
More broadly, our results suggest that hallucination mitigation is not simply a matter of identifying the right neurons, and point to a deeper separation between what representations reveal and what they allow us to change.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.