AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

arXiv cs.AI·Parsa Mazaheri, Kasra Mazaheri

5/22/2026

·~2 min·5/22/2026·en·2

Quick Answer

Quick Take

AgentAtlas introduces a comprehensive evaluation framework for LLM agents, addressing fragmented benchmarks by proposing a six-state control-decision taxonomy and a nine-category trajectory-failure taxonomy. The study reveals that removing explicit labels significantly reduces trajectory accuracy by 14-40 percentage points across models, highlighting the inadequacy of single accuracy metrics for assessing agent performance.

Key Points

Introduces a six-state control-decision taxonomy for LLM agents.
Presents a nine-category trajectory-failure taxonomy with hierarchical labels.
Removes explicit labels, dropping trajectory accuracy by 14-40 percentage points.
No single model excels in control accuracy, trajectory diagnosis, and tool-context utility.
Demonstrates methodology with a fixed set of eight models generating 1,342 items.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 19 May 2026]

View PDF HTML (experimental)

Abstract:Large language model agents now act on codebases, browsers, operating systems, calendars, files, and tool ecosystems, but the benchmarks used to evaluate them are fragmented: each emphasizes a different unit of measurement (final task success, tool-call validity, repeated-pass consistency, trajectory safety, or attack robustness). A line of 2024-2025 work has converged on the diagnosis that a single accuracy column is no longer the right unit of comparison for deployable agents. AgentAtlas extends this line of work with four components: (i) a six-state control-decision taxonomy (Act / Ask / Refuse / Stop / Confirm / Recover); (ii) a nine-category trajectory-failure taxonomy with two orthogonal hierarchical labels (primary_error_source, impact); (iii) a taxonomy-aware vs. taxonomy-blind methodology that measures how much of a model's apparent capability comes from the supervision in the prompt; and (iv) a benchmark-coverage audit mapping fifteen agent benchmarks against six behavioral axes. To demonstrate the methodology we run a small fixed eight-model set (1,342 generated items, four frontier closed and four open-weight) under both prompt modes. Removing the explicit label menu drops every model's trajectory accuracy by 14-40 pp to a tight 0.54-0.62 floor regardless of family, and no single model wins on all three of control accuracy, trajectory diagnosis, and tool-context utility retention. We treat the synthetic run as a measurement-protocol demonstration, not a benchmark release.

Subjects:	Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
ACM classes:	I.2.7; I.2.6; I.2.11
Cite as:	arXiv:2605.20530 [cs.AI]
	(or arXiv:2605.20530v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2605.20530 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Parsa Mazaheri [view email]
[v1] Tue, 19 May 2026 22:05:12 UTC (5,440 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·David Krongauz, Arad Zulti, Eran Segal, Teddy Lazebnik

10h ago

FeaturedOriginal

Automatic Ordinary Differential Equations Discovery For Biological Systems Using Large Language Model Powered Agentic System

AI Summary

The MEDA system utilizes large language models and symbolic regression to autonomously discover ordinary differential equations for biological systems, achieving strong structural recovery and biologically plausible models. It outperforms existing methods by integrating domain knowledge and mechanistic constraints, demonstrating effective retrieval and extrapolation capabilities.

#LLM #Agent #Inference #AI Startup