Do No Harm? Hallucination and Actor-Level Abuse in Web-Deployed Medical Large Language Models
Quick Take
Medical LLMs exhibit significant risks of hallucination and compliance issues, necessitating improved evaluation frameworks.
Key Points
- 25-30% of MedGPTs show low factual accuracy.
- 33.6-54.3% violate operational thresholds.
- HAA-MedGPT dataset released for future safety research.
📖 Reader Mode
~2 min readAbstract:Medical large language models (LLMs), including custom medical GPTs (MedGPTs) and open-source models, are increasingly deployed on web platforms to provide clinical guidance. However, they pose risks of hallucination, policy noncompliance, and unsafe design. We conduct a large-scale assessment of 6,233 MedGPTs, evaluating a stratified sample of 1,500, together with 10 open-source LLMs. We introduce two frameworks: MedGPT-HEval for hallucination detection and an LLM-based pipeline for assessing policy violations and developer intent. Our results show that 25-30% of MedGPTs exhibit low factual accuracy, with bottom- and middle-tier models at highest risk; 33.6-54.3% violate operational thresholds, and 57.06% of Action-enabled models lack adequate privacy disclosures. Compared with open-source models, MedGPTs achieve higher factual accuracy and semantic alignment, though open-source models are more stable. These results reveal systemic gaps in hallucination and compliance, highlighting the need for multi-metric evaluation and stronger safeguards. We release HAA-MedGPT, a structured dataset that supports future research on the safety of web-facing medical LLMs.
| Subjects: | Computation and Language (cs.CL); Computers and Society (cs.CY) |
| Cite as: | arXiv:2605.20591 [cs.CL] |
| (or arXiv:2605.20591v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.20591 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Sunday Ogundoyin [view email]
[v1]
Wed, 20 May 2026 00:57:59 UTC (1,809 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.
