QueryGaussian: Scalable and Training-Free Open-Vocabulary 3D Instance Retrieval
Quick Answer
QueryGaussian introduces a training-free framework for scalable open-vocabulary 3D instance retrieval, achieving over 70% GPU memory reduction and 180x faster inference.
Quick Take
QueryGaussian introduces a training-free framework for scalable open-vocabulary 3D instance retrieval, achieving over 70% GPU memory reduction and 180x faster inference. This method leverages pre-trained 2D models for semantic interpretation, enabling efficient retrieval in city-scale environments with millions of instances.
Key Points
- QueryGaussian reduces GPU memory usage by over 70% compared to existing methods.
- Achieves 180x faster inference times, making it suitable for real-time applications.
- Utilizes pre-trained 2D vision models for effective semantic understanding.
- Decouples semantic understanding from geometric representation for improved efficiency.
- Enables retrieval in city-scale scenes with tens of millions of instances.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 19733v1 Announce Type: new Abstract: Efficiently retrieving specific 3D instances from large-scale scenes via natural language prompts remains a formidable challenge in multimedia analysis. Existing approaches predominantly follow a "scene-level embedding" paradigm, which requires distilling high-dimensional semantic features into every 3D primitive.
This strategy suffers from a fundamental architectural bottleneck: memory and computational costs scale linearly with scene complexity, inevitably triggering out-of-memory (OOM) failures in city-scale environments. To address this barrier, we propose QueryGaussian, a training-free framework for expeditious and scalable open-vocabulary 3D instance retrieval.
Unlike holistic semantic distillation, QueryGaussian employs an instance-level query mechanism that decouples semantic understanding from geometric representation. Specifically, we leverage pre-trained 2D vision models to interpret user prompts and lift segmentation masks into 3D via a concurrent maximum-weight association strategy, ensuring semantic-visual consistency. To mitigate projection ambiguity, we introduce a temporal fusion module with multi-stage adaptive density clustering.
Experimental results demonstrate that QueryGaussian not only matches the accuracy of state-of-the-art methods but also delivers a decisive efficiency leap, reducing GPU memory usage by over 70% and accelerating inference by 180x. Crucially, QueryGaussian enables expeditious instance retrieval on city-scale scenes containing tens of millions of Gaussians using consumer-grade hardware.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.


