VL-DINO: Leveraging CLIP Vision-Language Knowledge for Open-Vocabulary Object Detectio

arXiv cs.CV·Hao Zhang, Qinran Lin, Linqi Song, Yong Li

2d ago

·~1 min·6/11/2026·en·0

Quick Answer

VL-DINO enhances open-vocabulary object detection by integrating CLIP's vision-language knowledge, achieving 36.3 and 38.1 AP on the LVIS benchmark for VL-DINO-T and VL-DINO-L, respectively.

Quick Take

VL-DINO enhances open-vocabulary object detection by integrating CLIP's vision-language knowledge, achieving 36.3 and 38.1 AP on the LVIS benchmark for VL-DINO-T and VL-DINO-L, respectively. This model outperforms previous state-of-the-art methods, demonstrating significant improvements in detection accuracy through innovative modules like QPSC and VSE.

Key Points

VL-DINO employs a Query-guided Positive Sample Construction module for enhanced training.
The Visual Semantic Encoder distills CLIP visual knowledge into backbone features.
Object-Region Semantic Alignment aligns object-centric features with textual embeddings.
VL-DINO consistently outperforms prior methods on the LVIS benchmark.
The model demonstrates effective integration of textual and visual knowledge.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 11546v1 Announce Type: new Abstract: Vision-language models like CLIP can provide rich semantic priors for open-vocabulary object detection. However, jointly integrating both textual and visual knowledge into detection architectures remains challenging. In this paper, we propose VL-DINO, an open-vocabulary detector that enhances DINO through more effective exploitation of CLIP's vision-language knowledge.

Specifically, a Query-guided Positive Sample Construction (QPSC) module is first developed to construct additional high-quality positive samples, enabling the vanilla DINO framework to better accommodate mixed training across heterogeneous data sources while providing more vision-language alignment signals, thereby incorporating richer textual knowledge during training.

A Visual Semantic Encoder (VSE) module is then introduced to distill CLIP visual knowledge into backbone-extracted features, producing fused features for subsequent encoder refinement. Based on the fused features, an Object-Region Semantic Alignment (ORSA) module extracts object-centric region features and aligns them with the corresponding textual embeddings, further incorporating textual cues. In the zero-shot setting, VL-DINO-T and VL-DINO-L achieve 36. 3 and 38.

1 AP on the LVIS benchmark, respectively, consistently outperforming prior advanced approaches. Extensive experiments demonstrate the effectiveness and competitive performance of the proposed design.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

1w ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup