Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?

arXiv cs.CV·Yichen Feng, Yuetai Li, Chunjiang Liu, Yuanyuan Chen, Fengqing Jiang, Yue Huang, Hang Hua, Zhengqing Yuan, Kaiyuan Zheng, Luyao Niu, Bhaskar Ramasubramanian, Basel Alomair, Xiangliang Zhang, Misha Sra, Zichen Chen, Radha Poovendran, Zhangchen Xu

3d ago

·~2 min·5/14/2026·en·1

Quick Take

The Visual Aesthetic Benchmark reveals gaps in MLLM aesthetic judgments compared to human experts.

Key Points

VAB introduces comparative selection for aesthetic evaluation.
Current MLLMs perform poorly against expert judgments.
Fine-tuning improves model accuracy significantly.

📖 Reader Mode

~2 min read

[Submitted on 12 May 2026]

Authors:Yichen Feng, Yuetai Li, Chunjiang Liu, Yuanyuan Chen, Fengqing Jiang, Yue Huang, Hang Hua, Zhengqing Yuan, Kaiyuan Zheng, Luyao Niu, Bhaskar Ramasubramanian, Basel Alomair, Xiangliang Zhang, Misha Sra, Zichen Chen, Radha Poovendran, Zhangchen Xu

View PDF

Abstract:Multimodal large language models (MLLMs) are now routinely deployed for visual understanding, generation, and curation. A substantial fraction of these applications require an explicit aesthetic judgment. Most existing solutions reduce this judgment to predicting a scalar score for a single image. We first ask whether such scores faithfully capture comparative preference: in a controlled study with eight expert annotators, score-derived rankings align poorly with the same annotators' direct comparisons, while direct ranking yields substantially higher inter-annotator agreement on best- and worst-image labels. Motivated by this finding, we introduce the Visual Aesthetic Benchmark (VAB), which casts aesthetic evaluation as comparative selection over candidate sets with matched subject matter. VAB contains 400 tasks and 1,195 images across fine art, photography, and illustration, with labels derived from the consensus of 10 independent expert judges per task. Evaluating 20 frontier MLLMs and six dedicated visual-quality reward models, we find that the strongest system identifies both the best and the worst image correctly across three random permutations of the candidate order in only 26.5% of tasks, far below the 68.9% achieved by human experts. Fine-tuning a 35B-parameter model on 2,000 expert examples brings its accuracy close to that of a 397B-parameter open-weight model, suggesting that the comparative signal in VAB is transferable. Together, these results expose a clear and measurable gap between current multimodal models and expert aesthetic judgment, and VAB provides the first set-based, expert-grounded testbed on which that gap can be tracked and closed.

Comments:	Project page: this https URL. Code: this https URL. Dataset: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
ACM classes:	I.2.10; I.4.9; I.2.7
Cite as:	arXiv:2605.12684 [cs.CV]
	(or arXiv:2605.12684v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2605.12684 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Zichen Chen [view email]
[v1] Tue, 12 May 2026 19:33:28 UTC (29,129 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?

Quick Take

Key Points

📖 Reader Mode

Submission history

More from arXiv cs.CV

CoReDiT: Spatial Coherence-Guided Token Pruning and Reconstruction for Efficient Diffusion Transformers

ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows

Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers

Related in this space

Invisible Orchestrators Suppress Protective Behavior and Dissociate Power-Holders: Safety Risks in Multi-Agent LLM Systems

Enhanced and Efficient Reasoning in Large Learning Models

Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study