PHREEQC-MCQ-200: A Diagnostic Benchmark for Tool-Augmented Scientific Simulator Agents

arXiv cs.AI·Ke Zhang, Sahchit Chundur, Mohammad Javad Qomi, Maziar Raissi

3h ago

·~2 min·7/2/2026·en·0

Quick Answer

PHREEQC-MCQ-200 is a benchmark for evaluating tool-augmented agents in aqueous-geochemistry simulations, revealing that simulator access enhances accuracy but can also lead to regressions.

Quick Take

PHREEQC-MCQ-200 is a benchmark for evaluating tool-augmented agents in aqueous-geochemistry simulations, revealing that simulator access enhances accuracy but can also lead to regressions. The study emphasizes the importance of evaluating scientific agents not just on accuracy but also on retention and output-access sensitivity.

Key Points

Benchmark consists of 200 multiple-choice questions from 21 validated PHREEQC scenarios.
Simulator access significantly improves aggregate accuracy across various model families.
Tool-augmented agents can regress on items they previously answered correctly without tools.
Output-access protocol impacts performance, with table-of-contents interfaces benefiting stronger models.
Evaluations should report accuracy, item-level retention, and computation chain failures.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2607. 00436v1 Announce Type: new Abstract: Large language model agents are increasingly connected to scientific software, yet it remains unclear when tool access makes scientific computation more reliable rather than merely more complex. We introduce PHREEQC-MCQ-200, a benchmark for evaluating tool-augmented agents on deterministic aqueous-geochemistry simulations.

The benchmark contains 200 multiple-choice questions derived from 21 validated PHREEQC scenarios, requiring agents to construct simulator inputs, execute PHREEQC, inspect structured outputs, and commit to final answers. Across multiple frontier and mid-tier model families, simulator access substantially improves aggregate accuracy, confirming that grounded execution is necessary for many scientific-computation tasks.

However, the gains are not monotonic: tool-augmented agents also lose items they answered correctly without tools, revealing regressions that average accuracy alone hides. We further show that output-access protocol matters. A table-of-contents interface can reduce token cost while preserving or improving accuracy for stronger models, but it degrades performance for mid-tier models that cannot reliably navigate structured simulator outputs.

PHREEQC-MCQ-200 therefore frames scientific as an end-to-end diagnostic problem rather than a simple tool-calling capability. We argue that evaluations of scientific agents should report not only accuracy, but also item-level retention, output-access sensitivity, trajectory failures, and where the computation chain breaks.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Binghai Wang, Chenlong Zhang, Dayiheng Liu, Jiajun Zhang, Jiawei Chen, Mouxiang Chen, Rongyao Fang, Siyuan Zhang, Xuwu Wang, Yuheng Jing, Zeyao Ma, Zeyu Cui

6d ago

FeaturedOriginal

The Verification Horizon: No Silver Bullet for Coding Agent Rewards

AI Summary

As coding agents evolve, verifying solutions becomes more challenging than generating them, necessitating a focus on scalable, faithful, and robust verification methods. The study reveals that no fixed reward function can sustain effectiveness as model capabilities advance, emphasizing the need for verification to evolve alongside solution generation.

#Agent #AI Coding #Inference #Policy