What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs

arXiv cs.CV·Jiaping Lin, Fei Shen, Junzhe Li, Ping Nie, Fei Yu, Ming Li, Haizhou Li

3d ago

·~2 min·5/14/2026·en·1

Quick Take

The study reveals that prefill is crucial for GUI grounding in VLMs, proposing a new method to enhance candidate selection.

Key Points

Grounding in VLMs follows a two-stage paradigm.
Errors in candidate selection affect final decoding.
Re-Prefill introduces a second prefill stage for refinement.

📖 Reader Mode

~2 min read

[Submitted on 10 May 2026]

View PDF HTML (experimental)

Abstract:Existing training-free approaches for GUI grounding often rely on multiple inference runs, such as iterative cropping or candidate aggregation, to identify target elements. Despite this additional computation, each forward pass still independently interprets the instruction and parses the visual layout, without enabling progressive interaction among visual tokens. In this paper, we study what happens during GUI grounding in Vision-Language Models (VLMs) and identify a previously overlooked bottleneck. We show that grounding follows a two-stage paradigm: the prefill stage determines candidate UI elements, while the decoding stage subsequently refines the final coordinates. This asymmetry establishes prefill as the critical step, as errors in candidate selection cannot be effectively corrected during decoding. Based on this observation, we propose Re-Prefill, a training-free method that revisits inference by introducing an attention-guided second prefill stage to refine target selection. Specifically, visual tokens that consistently receive high attention from the query position, i.e., the final token, across layers are extracted as a preliminary target hypothesis and appended to the input, together with the instruction hidden states, enabling the model to deeply re-think its decision before coordinate generation. Experiments across four VLMs and five benchmarks, including ScreenSpot-Pro, ScreenSpot-V2, OSWorld-G, UI-Vision, and MMBench-GUI, demonstrate consistent improvements without additional training, with gains of up to 4.3% on ScreenSpot-Pro. Code will be available at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2605.12549 [cs.CV]
	(or arXiv:2605.12549v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2605.12549 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Jiaping Lin [view email]
[v1] Sun, 10 May 2026 07:04:07 UTC (2,089 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs

Quick Take

Key Points

📖 Reader Mode

Submission history

More from arXiv cs.CV

CoReDiT: Spatial Coherence-Guided Token Pruning and Reconstruction for Efficient Diffusion Transformers

ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows

Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers

Related in this space

Invisible Orchestrators Suppress Protective Behavior and Dissociate Power-Holders: Safety Risks in Multi-Agent LLM Systems

Generative Floor Plan Design with LLMs via Reinforcement Learning with Verifiable Rewards

Distribution-Aware Algorithm Design with LLM Agents