Do No Harm? Hallucination and Actor-Level Abuse in Web-Deployed Medical Large Language Models

arXiv cs.CL·Sunday Oyinlola Ogundoyin, Muhammad Ikram, Rahat Masood

8h ago

·~2 min·5/21/2026·en·0

Quick Take

Medical LLMs exhibit significant risks of hallucination and compliance issues, necessitating improved evaluation frameworks.

Key Points

25-30% of MedGPTs show low factual accuracy.
33.6-54.3% violate operational thresholds.
HAA-MedGPT dataset released for future safety research.

📖 Reader Mode

~2 min read

[Submitted on 20 May 2026]

View PDF HTML (experimental)

Abstract:Medical large language models (LLMs), including custom medical GPTs (MedGPTs) and open-source models, are increasingly deployed on web platforms to provide clinical guidance. However, they pose risks of hallucination, policy noncompliance, and unsafe design. We conduct a large-scale assessment of 6,233 MedGPTs, evaluating a stratified sample of 1,500, together with 10 open-source LLMs. We introduce two frameworks: MedGPT-HEval for hallucination detection and an LLM-based pipeline for assessing policy violations and developer intent. Our results show that 25-30% of MedGPTs exhibit low factual accuracy, with bottom- and middle-tier models at highest risk; 33.6-54.3% violate operational thresholds, and 57.06% of Action-enabled models lack adequate privacy disclosures. Compared with open-source models, MedGPTs achieve higher factual accuracy and semantic alignment, though open-source models are more stable. These results reveal systemic gaps in hallucination and compliance, highlighting the need for multi-metric evaluation and stronger safeguards. We release HAA-MedGPT, a structured dataset that supports future research on the safety of web-facing medical LLMs.

Subjects:	Computation and Language (cs.CL); Computers and Society (cs.CY)
Cite as:	arXiv:2605.20591 [cs.CL]
	(or arXiv:2605.20591v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2605.20591 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Sunday Ogundoyin [view email]
[v1] Wed, 20 May 2026 00:57:59 UTC (1,809 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

Do No Harm? Hallucination and Actor-Level Abuse in Web-Deployed Medical Large Language Models

Quick Take

Key Points

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

Related in this space

Verifiable Agentic Infrastructure: Proof-Derived Authorization for Sovereign AI Systems

Nvidia says it has ‘largely conceded’ China’s AI chip market to Huawei