AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source Applications
Quick Take
AndroidDaily introduces a benchmark for evaluating mobile GUI agents on 94 closed-source Android applications, featuring 350 tasks. The GRADE evaluator achieves 87.37% agreement with human evaluators, revealing a 62.0% success rate for the best model, highlighting a gap in reasoning and execution capabilities.
Key Points
- AndroidDaily benchmarks 350 tasks across 94 popular Android applications.
- GRADE provides a three-tiered evaluation system for mobile GUI agents.
- The best performing model achieves a 62.0% success rate on AndroidDaily.
- GRADE's evaluations align 87.37% with human assessments.
- Closed-source applications remain largely untested in existing benchmarks.
Article Content
From source RSS / original summaryarXiv:2605. 27761v1 Announce Type: new Abstract: The rapid development of GUI foundation models and mobile GUI agents has spurred numerous evaluation benchmarks, yet most rely on simulated environments or open-source applications, leaving real-world closed-source applications largely unevaluated. The core difficulty is that closed-source applications do not expose internal states, making traditional automatic verification inapplicable.
To bridge this gap, we introduce AndroidDaily, a large-scale benchmark comprising 350 realistic daily-use tasks across 94 high-frequency Android applications spanning transportation, shopping, local services, entertainment, content creation, social media, and everyday utilities.
To enable automatic and verifiable assessment in these opaque environments, we propose Guideline-grounded Reviewer for Automatic Diagnostic Evaluation (GRADE), a process-aware evaluator built on a three-tiered system of observable external guidelines: operational obligations, output quality, and negative constraints.
GRADE tracks the agent's visual trajectory against these criteria and produces step-level diagnostic judgments, turning long-horizon, open-ended mobile interactions into verifiable evaluation without relying on hidden internal states. Experiments show that GRADE achieves 87. 37\% agreement with human evaluators. The strongest model reaches a 62. 0\% success rate on AndroidDaily, highlighting a substantial gap between current reasoning capabilities and practical execution in realistic mobile workflows.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.
