VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark

arXiv cs.AI·Amirhossein Dabiriaghdam, Shayan Vassef, Mohammadreza Bakhtiari, Yasamin Medghalchi, Ilker Hacihaliloglu, Mesrob Ohannessian, Lele Wang, Giuseppe Carenini

3h ago

·~1 min·6/4/2026·en·0

Quick Take

VAMPS introduces a benchmark for visual-assisted mathematical problem solving, revealing that direct analytical methods outperform tool-enabled visual solutions in 1,168 multimodal, bilingual question-answer pairs from Iranian University Entrance Exams. This highlights a significant gap in multimodal model performance when using visualization tools for reasoning.

Key Points

VAMPS consists of 1,168 multimodal, bilingual question-answer pairs.
The benchmark focuses on algebra and calculus problems from Iranian University Entrance Exams.
Direct analytical solving outperformed tool-enabled visual solving across diverse models.
The study highlights the importance of visualization tools in engineering and scientific workflows.
VAMPS aims to improve model performance in reasoning with visual aids.

Article Content

From source RSS / original summary

arXiv:2606. 04244v1 Announce Type: new Abstract: Multimodal large language models are increasingly capable of complex reasoning, yet their performance often degrades when they must externalize a problem through a tool and then reason over the tool's output, specifically when they rely on visual aids. This gap is especially important because real engineering and scientific workflows often rely on visualization tools for analysis, validation, and decision-making.

To study this discrepancy, we introduce VAMPS (Visual-Assisted Mathematical Problem Solving), a benchmark for graph-assisted mathematics. VAMPS contains 1,168 multimodal, bilingual multiple-choice question-answer pairs drawn from Iranian University Entrance Exam algebra and calculus problems and expanded with human-reviewed LLM-generated synthetic variants, all selected so that plotting provides a natural solution strategy by revealing intersections, extrema, asymptotes, etc.

Designed for both benchmarking and diagnosis, VAMPS goes beyond prior multimodal benchmarks that primarily evaluate reasoning over fixed visual inputs by testing whether a model can benefit from constructing a useful graph and grounding its answer in the resulting visualization. Overall, we found that across a diverse set of models, direct analytical solving surprisingly outperforms tool-enabled visual solving, even on problems where plotting is a natural strategy.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Xinyu Lu, Tianshu Wang, Pengbo Wang, zujie wen, Zhiqiang Zhang, Jun Zhou, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun

3h ago

FeaturedOriginal

The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?

AI Summary

The Meta-Agent Challenge (MAC) introduces a framework to evaluate AI's ability to autonomously develop agents, revealing that current models rarely match human-engineered policies and often display adversarial behaviors. This open-source benchmark highlights significant gaps in robustness and alignment, particularly among proprietary models.

#Agent #Open Source #AI Startup #Policy