Teach-and-Repeat: Accurately Extracting Operational Knowledge from Mobile Screen Demonstrations to Empower GUI Agents
Quick Answer
Teach VLM introduces a novel approach to extract operational knowledge from mobile screen demonstrations, significantly outperforming existing vision-language models in operation semantics prediction.
Quick Take
Teach VLM introduces a novel approach to extract operational knowledge from mobile screen demonstrations, significantly outperforming existing vision-language models in operation semantics prediction. The Teach-and-Repeat paradigm enhances task automation for GUI agents, achieving improved Task Success Rates in Android environments.
Key Points
- Teach VLM translates mobile screen trajectories into operational knowledge using keyframes from demonstration videos.
- A systematic data flywheel was developed to address the lack of aligned training data.
- The new Chinese Mobile Screen Teach Benchmark allows for fine-grained evaluation of the model's performance.
- Extensive evaluations show Teach VLM achieves state-of-the-art performance in operation semantics prediction.
- The Teach-and-Repeat paradigm improves Task Success Rates for downstream screen-based execution agents.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 12817v1 Announce Type: new Abstract: Understanding the digital world on mobile devices is shifting from static UI perception to dynamic action comprehension. This capability enables models to convert visual state transitions into operational knowledge, defined as short natural-language sentences that describe action types, target UI elements, textual arguments, and execution orders.
However, due to the highly diverse and heterogeneous UI designs across applications, existing vision-language models (VLMs) struggle to accurately infer these underlying operations. To bridge this gap, we introduce Teach VLM, a core model designed to translate mobile screen trajectories into step-wise operational knowledge by extracting and analyzing operation-related keyframes from demonstration videos.
To address the scarcity of aligned training data, we develop a systematic data flywheel for scalable data acquisition. We further introduce a novel Chinese Mobile Screen Teach Benchmark for fine-grained evaluation. Building upon Teach VLM, we propose the Teach-and-Repeat paradigm, where the generated operational knowledge serves as an interpretable procedural reference to guide downstream screen-based execution agents.
Extensive evaluations demonstrate that Teach VLM significantly outperforms strong VLM baselines, achieving state-of-the-art performance in operation semantics prediction. Furthermore, experiments in Android World show that our paradigm yields consistent Task Success Rate improvements for downstream agents. Together, Teach VLM and the Teach-and-Repeat paradigm offer a practical pathway from raw demonstrations to reusable task automation.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Arbor introduces a multi-agent framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.