PInVerify: An Offline Embodied Benchmark for Active Instance Verification
Quick Take
PInVerify introduces an offline benchmark for Active Instance Verification (AIV) with 3,000 episodes across 18 object categories, leveraging multimodal large language models. The best model achieved an 85.6% accuracy, outperforming embedding baselines by 4.9 percentage points, highlighting the importance of fine-grained semantic verification in embodied AI.
Key Points
- PInVerify features 3,000 evaluation episodes across 18 object categories.
- The benchmark utilizes a 6-sector navigation topology with trap views.
- Best MLLM-based baseline achieved 85.6% accuracy in instance verification.
- Outperformed embedding baselines by 4.9 percentage points in evaluations.
- No significant gains from active viewpoint selection in tested strategies.
Article Content
From source RSS / original summaryarXiv:2605. 30639v1 Announce Type: new Abstract: Embodied agents have made strong progress in navigating to target objects, but reaching the goal vicinity does not guarantee that the agent has found the correct instance: subtle attribute differences (e. g. , "white floral" vs. "white striped") often require close-range, multi-view inspection.
We address this gap with Active Instance Verification (AIV), a task in which an agent actively selects viewpoints around a candidate object to decide whether it matches a fine-grained natural-language description.
We formalize AIV as a finite-horizon decision process and introduce PInVerify, an offline embodied benchmark for AIV: 3,000 evaluation episodes across 18 object categories, delivered as multi-view captures with a 6-sector navigation topology that exposes trap views (navigable but uninformative) and unreachable sectors.
As reference baselines we build a training-free pipeline and a LoRA-fine-tuned end-to-end agent around open-source multimodal large language models (MLLMs) at on-device scale ($\leq$8B parameters), with attribute decomposition, a visibility-weighted multi-view tracker, and three next-best-view (NBV) strategies. In our evaluation across Qwen3-VL (4B/8B), SenseNova-SI-1. 2-InternVL3-8B, CLIP, and SigLIP2, the best MLLM-based baseline exceeds the best embedding baseline by 4. 9 pp; GT-box ablations show a +3.
1 pp detection gap; and we do not observe reliable gains from active viewpoint selection within the tested NBV strategies. A LoRA-fine-tuned agent (SFT+GSPO) reaches 85. 6%. PInVerify aims to support further work on active, fine-grained semantic verification in embodied AI. Code: https://github. com/Avalon-S/PInVerify.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, enabling efficient fine-tuning with only 0.11% parameter updates. It significantly enhances performance in few-shot learning and domain shifts across 15 biomedical imaging datasets, demonstrating robustness for clinical applications.
