The core problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes. Previous approaches train the feature pyramid and detection head from scratch, which breaks the vision-text feature alignment established during pretraining, and struggles to prevent the language model from forgetting unseen classes. We propose three methods to alleviate these issues. Firstly, a simple scheme is used to augment the text embeddings which prevents overfitting to a small number of classes seen during training, while simultaneously saving memory and computation. Secondly, the feature pyramid network and the detection head are modified to include trainable gated shortcuts, which encourages vision-text feature alignment and guarantees it at the start of detection training. Finally, a self-training approach is used to leverage a larger corpus of image-text pairs thus improving detection performance on classes with no human annotated bounding boxes. Our three methods are evaluated on the zero-shot version of the LVIS benchmark, each of them showing clear and significant benefits. Our final network achieves the new stateof-the-art on the mAP-all metric and demonstrates competitive performance for mAP-rare, as well as superior transfer to COCO and Objects365.
翻译:零样本开放词汇检测的核心问题在于如何对齐视觉与文本特征,以使检测器在未见类别上表现良好。先前的方法从零开始训练特征金字塔和检测头,这破坏了预训练期间建立的视觉-文本特征对齐,且难以防止语言模型遗忘未见类别。我们提出三种缓解这些问题的方法。首先,采用简单方案增强文本嵌入,防止在训练过程中对少量可见类别过拟合,同时节省内存和计算资源。其次,修改特征金字塔网络和检测头,加入可训练的门控捷径,促进视觉-文本特征对齐,并确保检测训练初期即存在对齐。最后,利用自训练方法使用更大规模的图像-文本对语料库,从而提升对无人工标注边界框类别的检测性能。我们在零样本版本的LVIS基准上评估了这三种方法,每种方法均显示出明确且显著的改进。最终网络在mAP-all指标上达到了新最优性能,并在mAP-rare上表现出竞争力,同时展现出在COCO和Objects365上的卓越迁移能力。