Perceive, Interact, Reason: Building Tool-Augmented Visual Agents for Spatial Reasoning
Quick Answer
This paper shows that The PERception-Interaction-reason Agent (PERIA) enhances spatial reasoning in vision-language models, achieving a 10.0% improvement on in-distribution benchmarks over Qwen3-8B and outperforming similar-sized models by 7.0%-14.8%.
Quick Take
The PERception-Interaction-reason Agent (PERIA) enhances spatial reasoning in vision-language models, achieving a 10.0% improvement on in-distribution benchmarks over Qwen3-8B and outperforming similar-sized models by 7.0%-14.8%. Its innovative tool-augmented approach enables effective multi-step visual interaction and evidence acquisition.
Key Points
- PERIA utilizes vision perception and interaction tools for enhanced spatial reasoning tasks.
- It combines supervised trajectory synthesis and composite rewards for training.
- PERIA-8B shows a 4.4% improvement on out-of-distribution benchmarks.
- It achieves performance comparable to larger models like Qwen3-VL-235B-A22B-Thinking.
- The model demonstrates significant advancements in multi-modal understanding and interaction.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 12830v1 Announce Type: new Abstract: While recent vision-language models (VLMs) demonstrate strong multimodal understanding, they remain limited in spatial reasoning tasks that require active evidence acquisition and multi-step visual interaction. This limitation suggests that relying solely on implicit visual representations from vision encoders is insufficient for recovering fine-grained spatial evidence.
We introduce PERception-Interaction-reason Agent (PERIA), a tool-augmented visual agent for spatial reasoning tasks across map reasoning, visual probing, and vision reconstruction. PERIA uses two lightweight tool families: vision perception tools for exposing textual, symbolic, and spatial evidence, and vision interaction tools for manipulating visual context, tracing paths, and verifying spatial relations.
To train PERIA, we develop a unified recipe that combines supervised trajectory synthesis, composite rewards, and Observation-Relaxed Group-in-Group Policy Optimization (OR-GIGPO) for effective multi-tool behavior. Experiments on 13 benchmarks from 8 datasets show that PERIA-8B improves over the Qwen3-8B backbone by 10. 0% on in-distribution benchmarks and 4. 4% on out-of-distribution benchmarks, while outperforming previous state-of-the-art baselines of similar size by 7. 0%-14. 8%.
It also achieves performance comparable to much larger models such as Qwen3-VL-235B-A22B-Thinking and GPT-5, demonstrating the effectiveness of PERIA in enhancing spatial reasoning capabilities.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.