Prompting language influences diagnostic reasoning and accuracy of large language models
Quick Take
Prompting language significantly affects diagnostic reasoning and accuracy in large language models.
Key Points
- Study compares English and French performance of five LLMs.
- Four models performed better in English across reasoning aspects.
- Findings highlight the importance of language in clinical AI deployment.
📖 Reader Mode
~2 min readAbstract:Large language models (LLMs) are increasingly explored for clinical decision support, yet most evaluations are conducted in English, leaving their reliability in other languages uncertain. Here we evaluate the impact of prompting language on diagnostic reasoning and final diagnosis accuracy by comparing English and French performance across five LLMs (o3, DeepSeek-R1, GPT-4-Turbo, Llama-3.1-405B-Instruct, and BioMistral-7B). A total of 180 clinical vignettes covering 16 medical specialties were assessed by two physicians using an 18-point scale evaluating both diagnosis accuracy and reasoning quality. Four of the five models performed better in English (mean difference 0.37-0.91, adjusted p < 0.05), with the gap spanning multiple aspects of reasoning, including differential diagnosis, logical structure, and internal validity. o3 was the only model showing no overall language effect. These findings demonstrate that prompting language remains a critical determinant of LLM clinical performance, with implications for equitable linguistico-cultural deployment worldwide.
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2605.19173 [cs.CL] |
| (or arXiv:2605.19173v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.19173 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Adrien Bazoge [view email]
[v1]
Mon, 18 May 2026 22:55:21 UTC (2,139 KB)
— Originally published at arxiv.org
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.