Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

arXiv cs.AI·Maksim Ivanov, Abhijay Rana

3d ago

·~2 min·5/27/2026·en·0

Quick Take

Anchor introduces a task-generation pipeline that formalizes business workflows into constraint optimization programs, creating 300 long-horizon tasks for ERP systems. This approach mitigates artifact drift, ensuring tasks have controlled difficulty and optimal solutions, with frontier models achieving optimality in only 17.4% of trials.

Key Points

Anchor generates natural-language instructions and environment configurations from a single specification.
The ERP-Bench benchmark includes 300 tasks across procurement and manufacturing workflows.
Generation parameters effectively predict the realized difficulty of the tasks.
Frontier models satisfy explicit task constraints in 26.1% of trials.
The task generator and ERP-Bench dataset are available at erpbench.ai.

Article Content

From source RSS / original summary

arXiv:2605. 26321v1 Announce Type: new Abstract: AI agents are beginning to complete valuable, long-horizon business operations tasks, but training and evaluation environments for enterprise work still struggle to balance realism, verifiability, and scale.

Environment and task creation frequently suffers from a failure mode we call artifact drift: when instructions, environments, oracles, and verifiers are created by loosely coupled processes, they frequently disagree on what a task requires, producing environments that are unsolvable, reward-hackable, or inconsistent. We introduce Anchor, a task-generation pipeline that formalizes domain experts' specifications of business workflows into constraint optimization programs.

From a single parametric specification, the pipeline jointly produces a natural-language instruction, environment configuration, solver-certified ground-truth solution, and state-based verifier. With Anchor, altering parameters yields new tasks with controlled difficulty and known optimal solutions, producing harness-agnostic environments whose rewards depend solely on end-state business correctness.

We apply Anchor to produce ERP-Bench: a benchmark of 300 long-horizon tasks spanning procurement and manufacturing workflows in a production-grade ERP system. We find that generation parameters predict realized difficulty, and that frontier models satisfy explicit task constraints in 26. 1% of trials but reach a fully optimal solution in only 17. 4% of trials. Overall, we show that Anchor and ERP-Bench offer a concrete recipe for building auditable evaluation environments for economically valuable agent work.

We release the task generator and ERP-Bench dataset at erpbench. ai

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Tyler Akidau, Tyler Rockwood, Johannes Br\"uderl, Marc Millstone

1d ago

FeaturedOriginal

The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane

AI Summary

The Redpanda Agentic Data Plane (ADP) introduces out-of-band metadata channels to enhance the safety of autonomous AI agents, ensuring secure data access and tamper-proof audit trails. This architecture mitigates risks associated with unpredictable AI behavior by enforcing governance throughout the agent lifecycle, demonstrated in a multi-agent trading system with strict data scoping and approval thresholds.

#Agent #Robotics #Security #Policy