DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

arXiv cs.AI·Yuxuan Gao, Megan Wang, Yi Ling Yu, Zijian Carl Ma, Ao Qu

5/20/2026

·~2 min·5/20/2026·en·3

Quick Answer

DecisionBench is a new benchmark for evaluating emergent delegation in long-horizon workflows, featuring 23,375 task instances across 11 models.

Quick Take

DecisionBench is a new benchmark for evaluating emergent delegation in long-horizon workflows, featuring 23,375 task instances across 11 models. Key findings reveal that mean end-task quality remains consistent across conditions, while routing fidelity varies significantly, indicating substantial unrealized potential for orchestration methods. The release includes a comprehensive substrate and analysis tools.

Key Points

Benchmark includes a fixed task suite and 11 models from 7 vendor families.
Mean end-task quality shows no significant difference across four awareness conditions.
Routing fidelity-at-1 varies from 7.5% to 29.5% depending on delivery channel.
Counterfactual ceiling indicates potential performance improvement of 15-31 percentage points.
Comprehensive release includes substrate, annotation layer, and analysis pipeline.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 18 May 2026]

View PDF HTML (experimental)

Abstract:We introduce DecisionBench, a benchmark substrate for emergent delegation in long-horizon agentic workflows. The substrate fixes a task suite (GAIA, tau-bench, BFCL multi-turn), a peer-model pool (11 models, 7 vendor families), a delegation interface (call_model plus an optional read_profile channel), a deterministic skill-annotation layer, and a multi-axis metric suite covering quality, cost, latency, delegation rate, routing fidelity-at-k, vendor self-preference, and a counterfactual-delegation ceiling. The substrate is agnostic to how peer information is generated or delivered, so learned routers, richer peer memories, adaptive profile construction, and multi-step delegation can all be evaluated against it. We characterize the substrate with a five-condition reference sweep on the full pool (n=23,375 task instances). Three benchmark-level findings emerge: (i) mean end-task quality is statistically indistinguishable across the four awareness conditions (|beta| <= 0.010, p >= 0.21), so quality-only evaluation would miss the orchestration signal; (ii) routing fidelity-at-1 ranges from 7.5% to 29.5% across conditions at near-equal mean quality, with delivery channel (on-demand tool vs. preloaded description) dominating description content; (iii) a counterfactual ceiling places perfect delegation 15-31 percentage points above measured performance on every suite, locating large unrealized headroom for future orchestration methods. We release the substrate, annotation layer, reference intervention suite, analysis pipeline, and 220 per-condition run archives.

Comments:	28 pages, 9 figures, 11 tables. Code and data: this https URL
Subjects:	Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
ACM classes:	I.2.7; I.2.6
Cite as:	arXiv:2605.19099 [cs.AI]
	(or arXiv:2605.19099v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2605.19099 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Megan Wang [view email]
[v1] Mon, 18 May 2026 20:37:14 UTC (2,230 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Ye Liu, Srijan Bansal, Bo Pang, Yang Li, Zeyu Leo Liu, Yifei Ming, Zixuan Ke, Shafiq Joty, Semih Yavuz

1d ago

FeaturedOriginal

Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

AI Summary

Procedural Memory Distillation (PMD) enhances reinforcement learning by converting cross-episode signals into reusable memory, improving Qwen3-8B and OLMo3-Instruct-7B models by 3.8-5.5% on SCIKNOWEVAL and 7.9-13.6% on . The co-evolution of policy and memory allows for more effective self-supervision, demonstrating significant performance gains when both components are active.

#LLM #AI Coding #Inference #Policy