A Task-State Representation for Long-Horizon Mobile GUI Agents

arXiv cs.CL·Yujie Zheng, Zikang Liu, Xin Zhao, Ji-Rong Wen

3h ago

·~1 min·7/2/2026·en·0

Quick Answer

This paper shows that The Task-State Representation (TSR) framework enhances long-horizon mobile GUI agents by decoupling task states from sensory inputs, achieving up to a 12-point increase in success rates on complex tasks without architectural changes.

Quick Take

The Task-State Representation (TSR) framework enhances long-horizon mobile GUI agents by decoupling task states from sensory inputs, achieving up to a 12-point increase in success rates on complex tasks without architectural changes.

Key Points

TSR maintains a global instruction summary and dynamic progress tracker.
It verifies actions with a transition-aware mechanism for improved reasoning.
Experiments show TSR's effectiveness across four mobile GUI benchmarks.
The framework is training-free and requires no architectural modifications.
TSR addresses issues like hallucinated progress and stale interface interactions.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Excerpt

From source RSS / original summary

arXiv:2607. 00502v1 Announce Type: new Abstract: While long-horizon mobile GUI agents typically rely on thought-action-observation loops, they struggle to separate persistent task states from transient screen observations. As execution histories grow, this entanglement imposes a severe context burden, causing agents to forget initial requirements, hallucinate progress, or repeatedly interact with stale interfaces.

To address this, we introduce Task-State Representation (TSR), a training-free framework that explicitly decouples task state from sensory input. Acting as a lightweight external wrapper, TSR maintains three structured components: a global instruction summary, a dynamic progress tracker for subgoals, and a transition-aware action verifier. By continuously updating through pre- and post-action visual comparisons, TSR effectively guides the agent's reasoning without requiring architectural modifications.

Experiments across four mobile GUI benchmarks validate TSR's effectiveness, yielding up to a 12 absolute point increase in success rate on complex cross-application and memory-intensive tasks.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

1w ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

A Task-State Representation for Long-Horizon Mobile GUI Agents

Quick Answer

Quick Take

Key Points

Paper Resources

Article Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

Article Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems