DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows
Quick Take
DecisionBench is a benchmark for evaluating emergent delegation in long-horizon workflows.
Key Points
- Includes a fixed task suite and peer-model pool.
- Evaluates multiple metrics like quality and delegation rate.
- Reveals significant unrealized potential for orchestration methods.
📖 Reader Mode
~2 min readAbstract:We introduce DecisionBench, a benchmark substrate for emergent delegation in long-horizon agentic workflows. The substrate fixes a task suite (GAIA, tau-bench, BFCL multi-turn), a peer-model pool (11 models, 7 vendor families), a delegation interface (call_model plus an optional read_profile channel), a deterministic skill-annotation layer, and a multi-axis metric suite covering quality, cost, latency, delegation rate, routing fidelity-at-k, vendor self-preference, and a counterfactual-delegation ceiling. The substrate is agnostic to how peer information is generated or delivered, so learned routers, richer peer memories, adaptive profile construction, and multi-step delegation can all be evaluated against it. We characterize the substrate with a five-condition reference sweep on the full pool (n=23,375 task instances). Three benchmark-level findings emerge: (i) mean end-task quality is statistically indistinguishable across the four awareness conditions (|beta| <= 0.010, p >= 0.21), so quality-only evaluation would miss the orchestration signal; (ii) routing fidelity-at-1 ranges from 7.5% to 29.5% across conditions at near-equal mean quality, with delivery channel (on-demand tool vs. preloaded description) dominating description content; (iii) a counterfactual ceiling places perfect delegation 15-31 percentage points above measured performance on every suite, locating large unrealized headroom for future orchestration methods. We release the substrate, annotation layer, reference intervention suite, analysis pipeline, and 220 per-condition run archives.
| Comments: | 28 pages, 9 figures, 11 tables. Code and data: this https URL |
| Subjects: | Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA) |
| ACM classes: | I.2.7; I.2.6 |
| Cite as: | arXiv:2605.19099 [cs.AI] |
| (or arXiv:2605.19099v1 [cs.AI] for this version) | |
| https://doi.org/10.48550/arXiv.2605.19099 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Megan Wang [view email]
[v1]
Mon, 18 May 2026 20:37:14 UTC (2,230 KB)
— Originally published at arxiv.org
More from arXiv cs.AI
See more →From Prompts to Protocols: An AI Agent for Laboratory Automation
An AI agent integrates large language models for automating laboratory protocols, enhancing efficiency and accuracy.