VL-DINO: Leveraging CLIP Vision-Language Knowledge for Open-Vocabulary Object Detectio
Quick Answer
VL-DINO enhances open-vocabulary object detection by integrating CLIP's vision-language knowledge, achieving 36.3 and 38.1 AP on the LVIS benchmark for VL-DINO-T and VL-DINO-L, respectively.
Quick Take
VL-DINO enhances open-vocabulary object detection by integrating CLIP's vision-language knowledge, achieving 36.3 and 38.1 AP on the LVIS benchmark for VL-DINO-T and VL-DINO-L, respectively. This model outperforms previous state-of-the-art methods, demonstrating significant improvements in detection accuracy through innovative modules like QPSC and VSE.
Key Points
- VL-DINO employs a Query-guided Positive Sample Construction module for enhanced training.
- The Visual Semantic Encoder distills CLIP visual knowledge into backbone features.
- Object-Region Semantic Alignment aligns object-centric features with textual embeddings.
- VL-DINO consistently outperforms prior methods on the LVIS benchmark.
- The model demonstrates effective integration of textual and visual knowledge.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 11546v1 Announce Type: new Abstract: Vision-language models like CLIP can provide rich semantic priors for open-vocabulary object detection. However, jointly integrating both textual and visual knowledge into detection architectures remains challenging. In this paper, we propose VL-DINO, an open-vocabulary detector that enhances DINO through more effective exploitation of CLIP's vision-language knowledge.
Specifically, a Query-guided Positive Sample Construction (QPSC) module is first developed to construct additional high-quality positive samples, enabling the vanilla DINO framework to better accommodate mixed training across heterogeneous data sources while providing more vision-language alignment signals, thereby incorporating richer textual knowledge during training.
A Visual Semantic Encoder (VSE) module is then introduced to distill CLIP visual knowledge into backbone-extracted features, producing fused features for subsequent encoder refinement. Based on the fused features, an Object-Region Semantic Alignment (ORSA) module extracts object-centric region features and aligns them with the corresponding textual embeddings, further incorporating textual cues. In the zero-shot setting, VL-DINO-T and VL-DINO-L achieve 36. 3 and 38.
1 AP on the LVIS benchmark, respectively, consistently outperforming prior advanced approaches. Extensive experiments demonstrate the effectiveness and competitive performance of the proposed design.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.