We present a new open-vocabulary detection framework. Our framework uses both image-level labels and detailed detection annotations when available. Our framework proceeds in three steps. We first train a language-conditioned object detector on fully-supervised detection data. This detector gets to see the presence or absence of ground truth classes during training, and conditions prediction on the set of present classes. We use this detector to pseudo-label images with image-level labels. Our detector provides much more accurate pseudo-labels than prior approaches with its conditioning mechanism. Finally, we train an unconditioned open-vocabulary detector on the pseudo-annotated images. The resulting detector, named DECOLA, shows strong zero-shot performance in open-vocabulary LVIS benchmark as well as direct zero-shot transfer benchmarks on LVIS, COCO, Object365, and OpenImages. DECOLA outperforms the prior arts by 17.1 AP-rare and 9.4 mAP on zero-shot LVIS benchmark. DECOLA achieves state-of-the-art results in various model sizes, architectures, and datasets by only training on open-sourced data and academic-scale computing. Code is available at https://github.com/janghyuncho/DECOLA.
翻译:我们提出了一种新的开放词汇检测框架。该框架在可用时同时利用图像级标签和详细检测标注。框架分为三个步骤:首先,在完全监督的检测数据上训练一个语言条件对象检测器。该检测器在训练期间能感知真实类别的存在与否,并根据存在的类别集条件化预测。我们使用该检测器为带有图像级标签的图像生成伪标签。相较于先前方法,我们的检测器凭借其条件机制提供了更精确的伪标签。最后,我们在伪标注图像上训练一个无条件的开放词汇检测器。由此产生的检测器名为DECOLA,在开放词汇LVIS基准测试以及LVIS、COCO、Object365和OpenImages的直接零样本迁移基准测试中表现出强大的零样本性能。DECOLA在零样本LVIS基准测试上以17.1 AP-rare和9.4 mAP的优势超越先前技术。仅通过开源数据和学术级计算训练,DECOLA在不同模型规模、架构和数据集上均达到最先进水平。代码发布于 https://github.com/janghyuncho/DECOLA。