Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring

arXiv cs.CV·Jiaqing Zhang, Sandeep Elluri, Bhanu Cherukuvada, Yonah Joffe, Jessica Sena, Miguel Contreras, Scott Siegel, Subhash Nerella, Catherine Price, Parisa Rashidi

1d ago

·~2 min·5/19/2026·en·3

Quick Take

Study reveals central tendency bias in multimodal LLMs scoring clinical ordinal scales.

Key Points

LLMs show systematic endpoint compression in scoring.
Zero-shot LLMs perform competitively despite higher errors.
Calibration-aware evaluation is essential for clinical applications.

📖 Reader Mode

~2 min read

[Submitted on 11 May 2026]

View PDF HTML (experimental)

Abstract:Multimodal large language models (LLMs) are increasingly explored as automated evaluators in clinical settings, yet their scoring behavior on ordinal clinical scales remains poorly understood. We benchmark three frontier LLM families against supervised deep learning models for scoring Clock Drawing Test (CDT) images on two public datasets using the Shulman rubric. While fully fine-tuned Vision Transformers achieve the best calibration (MAE 0.52, within-1 accuracy 91%), zero-shot LLMs remain competitive on tolerance-based agreement (GPT-5 MAE 0.67, within-1 accuracy 92%) despite higher absolute error. However, per-score analysis reveals that all three LLM families exhibit a pronounced central tendency effect (systematic endpoint compression): predictions are systematically compressed toward the middle of the scale, with over-prediction at the low end (score 0 to 1) and under-prediction at the high end (score 5 to 4). This effect disproportionately affects the clinically critical extremes where accurate scoring most impacts screening decisions for cognitive impairment. Targeted ablations show that neither few-shot exemplars spanning the full score range nor removing clinical terminology from the prompt eliminates the effect. Our findings extend the LLM-as-a-judge bias literature from NLP evaluation to clinical assessment, and highlight the need for calibration-aware evaluation and post-hoc calibration before deploying LLM-based raters in high-stakes screening workflows.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2605.16386 [cs.CV]
	(or arXiv:2605.16386v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2605.16386 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Jiaqing Zhang [view email]
[v1] Mon, 11 May 2026 15:37:24 UTC (1,220 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring

Quick Take

Key Points

📖 Reader Mode

Submission history

More from arXiv cs.CV

GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning

Structuring Open-Ended NAS: Semi-Automated Design Knowledge Structuring with LLMs for Efficient Neural Architecture Search

MedFM-Robust: Benchmarking Robustness of Medical Foundation Models

Related in this space

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Verifiable Agentic Infrastructure: Proof-Derived Authorization for Sovereign AI Systems

Learn-by-Wire Training Control Governance: Bounded Autonomous Training Under Stress for Stability and Efficiency