Vision-language models like CLIP can provide rich semantic priors for open-vocabulary object detection. However, jointly integrating both textual and visual knowledge into detection architectures remains challenging. In this paper, we propose VL-DINO, an open-vocabulary detector that enhances DINO through more effective exploitation of CLIP's vision-language knowledge. Specifically, a Query-guided Positive Sample Construction (QPSC) module is first developed to construct additional high-quality positive samples, enabling the vanilla DINO framework to better accommodate mixed training across heterogeneous data sources while providing more vision-language alignment signals, thereby incorporating richer textual knowledge during training. A Visual Semantic Encoder (VSE) module is then introduced to distill CLIP visual knowledge into backbone-extracted features, producing fused features for subsequent encoder refinement. Based on the fused features, an Object-Region Semantic Alignment (ORSA) module extracts object-centric region features and aligns them with the corresponding textual embeddings, further incorporating textual cues. In the zero-shot setting, VL-DINO-T and VL-DINO-L achieve 36.3 and 38.1 AP on the LVIS benchmark, respectively, consistently outperforming prior advanced approaches. Extensive experiments demonstrate the effectiveness and competitive performance of the proposed design.
翻译:视觉语言模型(如CLIP)能够为开放词汇目标检测提供丰富的语义先验。然而,如何将文本与视觉知识联合整合至检测架构中仍具挑战性。本文提出VL-DINO,一种通过更有效地利用CLIP视觉语言知识增强DINO的开放词汇检测器。具体而言,首先开发查询引导正样本构建模块(QPSC),用于构建额外的高质量正样本,使原生DINO框架在提供更多视觉语言对齐信号的同时,更好地适应异构数据源的混合训练,从而在训练过程中融入更丰富的文本知识。随后引入视觉语义编码器模块(VSE),将CLIP视觉知识蒸馏至骨干网络提取的特征中,生成融合特征供后续编码器优化。基于融合特征,对象区域语义对齐模块(ORSA)提取以对象为中心的区域特征,并将其与对应文本嵌入对齐,进一步融入文本线索。在零样本设置下,VL-DINO-T与VL-DINO-L在LVIS基准上分别达到36.3和38.1的AP,持续超越先前先进方法。大量实验证明了所提设计的有效性与竞争性性能。