Learning from pseudo-labels that generated with VLMs~(Vision Language Models) has been shown as a promising solution to assist open vocabulary detection (OVD) in recent studies. However, due to the domain gap between VLM and vision-detection tasks, pseudo-labels produced by the VLMs are prone to be noisy, while the training design of the detector further amplifies the bias. In this work, we investigate the root cause of VLMs' biased prediction under the OVD context. Our observations lead to a simple yet effective paradigm, coded MarvelOVD, that generates significantly better training targets and optimizes the learning procedure in an online manner by marrying the capability of the detector with the vision-language model. Our key insight is that the detector itself can act as a strong auxiliary guidance to accommodate VLM's inability of understanding both the ``background'' and the context of a proposal within the image. Based on it, we greatly purify the noisy pseudo-labels via Online Mining and propose Adaptive Reweighting to effectively suppress the biased training boxes that are not well aligned with the target object. In addition, we also identify a neglected ``base-novel-conflict'' problem and introduce stratified label assignments to prevent it. Extensive experiments on COCO and LVIS datasets demonstrate that our method outperforms the other state-of-the-arts by significant margins. Codes are available at https://github.com/wkfdb/MarvelOVD
翻译:近期研究表明,利用视觉语言模型生成的伪标签进行学习,已成为辅助开放词汇目标检测的一种有前景的解决方案。然而,由于视觉语言模型与视觉检测任务之间存在领域差异,视觉语言模型产生的伪标签往往带有噪声,而检测器的训练设计进一步放大了这种偏差。本文深入探究了在开放词汇检测背景下视觉语言模型产生有偏预测的根本原因。基于观察,我们提出了一种简洁而高效的范式,命名为MarvelOVD,该范式通过融合检测器与视觉语言模型的能力,在线生成显著更优的训练目标并优化学习过程。我们的核心见解是:检测器本身可作为强大的辅助引导,以弥补视觉语言模型在理解图像中提议区域的“背景”及上下文信息方面的不足。基于此,我们通过在线挖掘大幅净化了噪声伪标签,并提出自适应重加权方法以有效抑制与目标物体未充分对齐的有偏训练框。此外,我们还识别了一个被忽视的“基础-新类冲突”问题,并引入分层标签分配策略以避免该问题。在COCO和LVIS数据集上的大量实验表明,本方法以显著优势超越了其他现有最优方法。代码发布于https://github.com/wkfdb/MarvelOVD。