IntentionNav: A Benchmark for Intent-Driven Object Navigation from Implicit Human Instruction

arXiv cs.CV·Lin Qian, Shijie Li, Sihao Lin, Xuan Zhang, Bangya Liu, Yanran Li, Hujun Yin

5d ago

·~2 min·5/25/2026·en·0

Quick Take

IntentionNav introduces a benchmark for intent-driven object navigation, evaluating VLMs with 500 intents across 176 scenes. Models achieved a 48.3% identification rate but only 24.9% terminal success, highlighting challenges in interpreting indirect human instructions.

Key Points

IntentionNav includes 500 intents and 64 target categories in 176 Isaac Sim scenes.
Models achieved a 48.3% identification rate for intended targets.
Only 24.9% of navigation attempts resulted in successful task completion.
Event-script intents had the highest success rate at 28.7%.
Indirect human intent remains a significant challenge for embodied AI.

Article Content

From source RSS / original summary

arXiv:2605. 23187v1 Announce Type: new Abstract: Existing object navigation benchmarks usually tell an embodied agent which object category to find, such as microwave or chair. Human-facing embodied AI is often asked something less direct: "I need something to warm this food" or "the room feels stuffy. " The agent must infer the object that can satisfy the need, find a scene-grounded instance, and decide whether the goal has been reached.

We study this setting as intent-driven object navigation and introduce IntentionNav, a diagnostic benchmark for active object search from implicit human instructions. Each episode provides a free-text intent, RGB-D observations, and pose, but withholds the target object name. IntentionNav contains 500 intents over 176 Isaac Sim scenes and 64 target categories.

Each intent is rewritten in four controlled instruction styles and annotated with one of four intent modes, separating surface phrasing from semantic cue type under matched geometry. This paired design supports analysis of target inference, language robustness, neighborhood reachability, and terminal success rather than only aggregate success. We evaluated three VLMs using a fixed active-navigation agent. Models identify the intended target in 48. 3 percent of episodes and enter its 2 m neighborhood in 68.

7 percent, but terminate successfully in only 24. 9 percent and achieve grounded 1 m success in 5. 5 percent. Success is highest for event-script intents (28. 7 percent) and lower for physical-state and affordance intents (19. 2 percent and 18. 5 percent), showing that indirect human intent remains a bottleneck for target selection, visual verification, and terminal localization in active embodied search.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Taha Koleilat, Hassan Rivaz, Yiming Xiao

3d ago

FeaturedOriginal

Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning

AI Summary

Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.

#AI Coding #Inference #Open Source

IntentionNav: A Benchmark for Intent-Driven Object Navigation from Implicit Human Instruction

Quick Take

Key Points

Article Content

Want this in your inbox every morning?

More from arXiv cs.CV

Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning

Deep Learning-Based Automated Quantification of TIMI Myocardial Perfusion Frame Count (DL-TMPFC) from Coronary Angiography: A Novel Framework for Rapid Assessment of Microvascular Dysfunction

GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning

Related in this space

The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane

TorqueAGI Announces Collaborations with NVIDIA, John Deere, and Dexterity to Advance Physical AI for Enterprise-Grade Robots

FORT Robotics Acquires Mapless AI to Expand Its Trust Platform with Remote Supervision and Active Safety Capabilities