In this paper, we for the first time explore helpful multi-modal contextual knowledge to understand novel categories for open-vocabulary object detection (OVD). The multi-modal contextual knowledge stands for the joint relationship across regions and words. However, it is challenging to incorporate such multi-modal contextual knowledge into OVD. The reason is that previous detection frameworks fail to jointly model multi-modal contextual knowledge, as object detectors only support vision inputs and no caption description is provided at test time. To this end, we propose a multi-modal contextual knowledge distillation framework, MMC-Det, to transfer the learned contextual knowledge from a teacher fusion transformer with diverse multi-modal masked language modeling (D-MLM) to a student detector. The diverse multi-modal masked language modeling is realized by an object divergence constraint upon traditional multi-modal masked language modeling (MLM), in order to extract fine-grained region-level visual contexts, which are vital to object detection. Extensive experiments performed upon various detection datasets show the effectiveness of our multi-modal context learning strategy, where our approach well outperforms the recent state-of-the-art methods.
翻译:本文首次探索利用有益的多模态上下文知识来理解开放词汇目标检测(OVD)中的新类别。多模态上下文知识指的是区域与词之间的联合关联关系。然而,将这种多模态上下文知识融入OVD颇具挑战性。原因是先前的检测框架无法联合建模多模态上下文知识,因为目标检测器仅支持视觉输入,且在测试阶段不提供描述性文本。为此,我们提出了一种多模态上下文知识蒸馏框架MMC-Det,将学习到的上下文知识从包含多样化多模态掩码语言建模(D-MLM)的教师融合Transformer迁移至学生检测器。多样化多模态掩码语言建模通过在传统多模态掩码语言建模(MLM)上引入目标差异约束来实现,以提取对目标检测至关重要的细粒度区域级视觉上下文。在多个检测数据集上进行的大量实验表明,我们的多模态上下文学习策略是有效的,该方法显著优于最新最先进的方法。