Inspired by the outstanding zero-shot capability of vision language models (VLMs) in image classification tasks, open-vocabulary object detection has attracted increasing interest by distilling the broad VLM knowledge into detector training. However, most existing open-vocabulary detectors learn by aligning region embeddings with categorical labels (e.g., bicycle) only, disregarding the capability of VLMs on aligning visual embeddings with fine-grained text description of object parts (e.g., pedals and bells). This paper presents DVDet, a Descriptor-Enhanced Open Vocabulary Detector that introduces conditional context prompts and hierarchical textual descriptors that enable precise region-text alignment as well as open-vocabulary detection training in general. Specifically, the conditional context prompt transforms regional embeddings into image-like representations that can be directly integrated into general open vocabulary detection training. In addition, we introduce large language models as an interactive and implicit knowledge repository which enables iterative mining and refining visually oriented textual descriptors for precise region-text alignment. Extensive experiments over multiple large-scale benchmarks show that DVDet outperforms the state-of-the-art consistently by large margins.
翻译:受视觉语言模型(VLM)在图像分类任务中展现的卓越零样本能力启发,通过将VLM的广泛知识蒸馏到检测器训练中,开放词汇目标检测日益受到关注。然而,现有大多数开放词汇检测器仅通过将区域嵌入与类别标签(如“自行车”)对齐进行学习,忽视了VLM在将视觉嵌入与目标部件的细粒度文本描述(如“踏板”和“车铃”)对齐方面的能力。本文提出DVDet——一种描述符增强的开放词汇检测器,通过引入条件上下文提示和分层文本描述符,实现精准的区域-文本对齐以及通用的开放词汇检测训练。具体而言,条件上下文提示将区域嵌入转化为类似图像的表示,可直接融入通用开放词汇检测训练。此外,我们引入大语言模型作为交互式隐式知识库,通过迭代挖掘和优化面向视觉的文本描述符,实现精准的区域-文本对齐。在多个大规模基准上的大量实验表明,DVDet持续且大幅度地超越了现有最先进方法。