A Task-State Representation for Long-Horizon Mobile GUI Agents
Quick Answer
This paper shows that The Task-State Representation (TSR) framework enhances long-horizon mobile GUI agents by decoupling task states from sensory inputs, achieving up to a 12-point increase in success rates on complex tasks without architectural changes.
Quick Take
The Task-State Representation (TSR) framework enhances long-horizon mobile GUI agents by decoupling task states from sensory inputs, achieving up to a 12-point increase in success rates on complex tasks without architectural changes.
Key Points
- TSR maintains a global instruction summary and dynamic progress tracker.
- It verifies actions with a transition-aware mechanism for improved reasoning.
- Experiments show TSR's effectiveness across four mobile GUI benchmarks.
- The framework is training-free and requires no architectural modifications.
- TSR addresses issues like hallucinated progress and stale interface interactions.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2607. 00502v1 Announce Type: new Abstract: While long-horizon mobile GUI agents typically rely on thought-action-observation loops, they struggle to separate persistent task states from transient screen observations. As execution histories grow, this entanglement imposes a severe context burden, causing agents to forget initial requirements, hallucinate progress, or repeatedly interact with stale interfaces.
To address this, we introduce Task-State Representation (TSR), a training-free framework that explicitly decouples task state from sensory input. Acting as a lightweight external wrapper, TSR maintains three structured components: a global instruction summary, a dynamic progress tracker for subgoals, and a transition-aware action verifier. By continuously updating through pre- and post-action visual comparisons, TSR effectively guides the agent's reasoning without requiring architectural modifications.
Experiments across four mobile GUI benchmarks validate TSR's effectiveness, yielding up to a 12 absolute point increase in success rate on complex cross-application and memory-intensive tasks.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.