PHREEQC-MCQ-200: A Diagnostic Benchmark for Tool-Augmented Scientific Simulator Agents
Quick Answer
PHREEQC-MCQ-200 is a benchmark for evaluating tool-augmented agents in aqueous-geochemistry simulations, revealing that simulator access enhances accuracy but can also lead to regressions.
Quick Take
PHREEQC-MCQ-200 is a benchmark for evaluating tool-augmented agents in aqueous-geochemistry simulations, revealing that simulator access enhances accuracy but can also lead to regressions. The study emphasizes the importance of evaluating scientific agents not just on accuracy but also on retention and output-access sensitivity.
Key Points
- Benchmark consists of 200 multiple-choice questions from 21 validated PHREEQC scenarios.
- Simulator access significantly improves aggregate accuracy across various model families.
- Tool-augmented agents can regress on items they previously answered correctly without tools.
- Output-access protocol impacts performance, with table-of-contents interfaces benefiting stronger models.
- Evaluations should report accuracy, item-level retention, and computation chain failures.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2607. 00436v1 Announce Type: new Abstract: Large language model agents are increasingly connected to scientific software, yet it remains unclear when tool access makes scientific computation more reliable rather than merely more complex. We introduce PHREEQC-MCQ-200, a benchmark for evaluating tool-augmented agents on deterministic aqueous-geochemistry simulations.
The benchmark contains 200 multiple-choice questions derived from 21 validated PHREEQC scenarios, requiring agents to construct simulator inputs, execute PHREEQC, inspect structured outputs, and commit to final answers. Across multiple frontier and mid-tier model families, simulator access substantially improves aggregate accuracy, confirming that grounded execution is necessary for many scientific-computation tasks.
However, the gains are not monotonic: tool-augmented agents also lose items they answered correctly without tools, revealing regressions that average accuracy alone hides. We further show that output-access protocol matters. A table-of-contents interface can reduce token cost while preserving or improving accuracy for stronger models, but it degrades performance for mid-tier models that cannot reliably navigate structured simulator outputs.
PHREEQC-MCQ-200 therefore frames scientific as an end-to-end diagnostic problem rather than a simple tool-calling capability. We argue that evaluations of scientific agents should report not only accuracy, but also item-level retention, output-access sensitivity, trajectory failures, and where the computation chain breaks.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Verification Horizon: No Silver Bullet for Coding Agent Rewards
As coding agents evolve, verifying solutions becomes more challenging than generating them, necessitating a focus on scalable, faithful, and robust verification methods. The study reveals that no fixed reward function can sustain effectiveness as model capabilities advance, emphasizing the need for verification to evolve alongside solution generation.