Prompting language influences diagnostic reasoning and accuracy of large language models

arXiv cs.CL·Adrien Bazoge, Josselin Corvellec, Sofiane Djillali Sid-Ahmed, Pierre-Antoine Gourraud

17h ago

·~2 min·5/20/2026·en·0

Quick Take

Prompting language significantly affects diagnostic reasoning and accuracy in large language models.

Key Points

Study compares English and French performance of five LLMs.
Four models performed better in English across reasoning aspects.
Findings highlight the importance of language in clinical AI deployment.

📖 Reader Mode

~2 min read

[Submitted on 18 May 2026]

View PDF HTML (experimental)

Abstract:Large language models (LLMs) are increasingly explored for clinical decision support, yet most evaluations are conducted in English, leaving their reliability in other languages uncertain. Here we evaluate the impact of prompting language on diagnostic reasoning and final diagnosis accuracy by comparing English and French performance across five LLMs (o3, DeepSeek-R1, GPT-4-Turbo, Llama-3.1-405B-Instruct, and BioMistral-7B). A total of 180 clinical vignettes covering 16 medical specialties were assessed by two physicians using an 18-point scale evaluating both diagnosis accuracy and reasoning quality. Four of the five models performed better in English (mean difference 0.37-0.91, adjusted p < 0.05), with the gap spanning multiple aspects of reasoning, including differential diagnosis, logical structure, and internal validity. o3 was the only model showing no overall language effect. These findings demonstrate that prompting language remains a critical determinant of LLM clinical performance, with implications for equitable linguistico-cultural deployment worldwide.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2605.19173 [cs.CL]
	(or arXiv:2605.19173v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2605.19173 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Adrien Bazoge [view email]
[v1] Mon, 18 May 2026 22:55:21 UTC (2,139 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Prompting language influences diagnostic reasoning and accuracy of large language models

Quick Take

Key Points

📖 Reader Mode

Submission history

More from arXiv cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

MMoA: An AI-Agent framework with recurrence for Memoried Mixure-of-Agent

Related in this space

From Prompts to Protocols: An AI Agent for Laboratory Automation

Agentic Trading: When LLM Agents Meet Financial Markets