IntentionNav: A Benchmark for Intent-Driven Object Navigation from Implicit Human Instruction
Quick Take
IntentionNav introduces a benchmark for intent-driven object navigation, evaluating VLMs with 500 intents across 176 scenes. Models achieved a 48.3% identification rate but only 24.9% terminal success, highlighting challenges in interpreting indirect human instructions.
Key Points
- IntentionNav includes 500 intents and 64 target categories in 176 Isaac Sim scenes.
- Models achieved a 48.3% identification rate for intended targets.
- Only 24.9% of navigation attempts resulted in successful task completion.
- Event-script intents had the highest success rate at 28.7%.
- Indirect human intent remains a significant challenge for embodied AI.
Article Content
From source RSS / original summaryarXiv:2605. 23187v1 Announce Type: new Abstract: Existing object navigation benchmarks usually tell an embodied agent which object category to find, such as microwave or chair. Human-facing embodied AI is often asked something less direct: "I need something to warm this food" or "the room feels stuffy. " The agent must infer the object that can satisfy the need, find a scene-grounded instance, and decide whether the goal has been reached.
We study this setting as intent-driven object navigation and introduce IntentionNav, a diagnostic benchmark for active object search from implicit human instructions. Each episode provides a free-text intent, RGB-D observations, and pose, but withholds the target object name. IntentionNav contains 500 intents over 176 Isaac Sim scenes and 64 target categories.
Each intent is rewritten in four controlled instruction styles and annotated with one of four intent modes, separating surface phrasing from semantic cue type under matched geometry. This paired design supports analysis of target inference, language robustness, neighborhood reachability, and terminal success rather than only aggregate success. We evaluated three VLMs using a fixed active-navigation agent. Models identify the intended target in 48. 3 percent of episodes and enter its 2 m neighborhood in 68.
7 percent, but terminate successfully in only 24. 9 percent and achieve grounded 1 m success in 5. 5 percent. Success is highest for event-script intents (28. 7 percent) and lower for physical-state and affordance intents (19. 2 percent and 18. 5 percent), showing that indirect human intent remains a bottleneck for target selection, visual verification, and terminal localization in active embodied search.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.
