STAGE-Claw: Automated State-based Agent Benchmarking for Realistic Scenarios
Quick Answer
STAGE-Claw introduces an automated framework for evaluating personal agents in realistic scenarios, creating 40 benchmark tasks and assessing 11 models based on final system state correctness.
Quick Take
STAGE-Claw introduces an automated framework for evaluating personal agents in realistic scenarios, creating 40 benchmark tasks and assessing 11 models based on final system state correctness. This approach enhances scalability and reliability in personal-, addressing limitations of traditional benchmarks.
Key Points
- STAGE-Claw automates the creation and validation of benchmark tasks for personal agents.
- Evaluates agents based on the correctness of the final system state, not just textual responses.
- Created a benchmark with 40 challenging real scenario tasks for comprehensive evaluation.
- Assessed 11 frontier models, analyzing task scores, costs, and common failure patterns.
- Offers a scalable, state-based evaluation method for realistic user scenarios.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 10394v1 Announce Type: new Abstract: Large language models are increasingly used to power personal agents for everyday applications, but evaluating these agents remains a challenge. Existing benchmarks still rely on sandboxed artifacts, static task design, and coarse scoring, which hinder scalability and limit progress toward reliable personal-.
This paper introduces STAGE-Claw, an automated framework for building and evaluating realistic personal-agent scenarios in state-based personal-computing environments. Given a task hint, STAGE-Claw automatically creates and validates a realistic benchmark task with its environment, task prompts, ground truth, and related verification programs.
Agents are then evaluated in realistic operating environments, where performance is measured by the correctness of the final system state rather than only the textual response. Using STAGE-Claw, this paper creates a benchmark with 40 challenging real scenario agent tasks, evaluates 11 frontier models, and analyzes their task scores, costs, tool-call reliability, and common failure patterns. Overall, STAGE-Claw offers a scalable, state-based way to evaluate agents in realistic user scenarios.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Arbor introduces a multi-agent framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.