We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we propose to randomly crop and resize regions of positional embeddings instead of using the whole image positional embeddings. This better matches the use of positional embeddings at region-level in the detection finetuning phase. In addition, we replace the common softmax cross entropy loss in contrastive learning with focal loss to better learn the informative yet difficult examples. Finally, we leverage recent advances in novel object proposals to improve open-vocabulary detection finetuning. We evaluate our full model on the LVIS and COCO open-vocabulary detection benchmarks and zero-shot transfer. RO-ViT achieves a state-of-the-art 34.1 $AP_r$ on LVIS, surpassing the best existing approach by +7.8 points in addition to competitive zero-shot transfer detection. Surprisingly, RO-ViT improves the image-level representation as well and achieves the state of the art on 9 out of 12 metrics on COCO and Flickr image-text retrieval benchmarks, outperforming competitive approaches with larger models.
翻译:我们提出区域感知开放词汇视觉Transformer(RO-ViT)——一种对比式图像-文本预训练方案,旨在弥合图像级预训练与开放词汇目标检测之间的差距。在预训练阶段,我们提出对位置编码进行随机裁剪和缩放区域操作,而非使用整图位置编码。这能更好地匹配检测微调阶段区域级位置编码的使用方式。此外,我们将对比学习中常用的softmax交叉熵损失替换为focal损失,以更有效地学习具有信息量但难以区分的样本。最后,我们利用新目标提议的最新进展来改进开放词汇检测微调。我们在LVIS和COCO开放词汇检测基准以及零样本迁移任务上评估完整模型。RO-ViT在LVIS上实现了34.1 $AP_r$的当前最优性能,超越现有最佳方法+7.8个百分点,同时保持竞争力的零样本迁移检测性能。令人惊讶的是,RO-ViT同样提升了图像级表示能力,在COCO和Flickr图像-文本检索基准的12项指标中取得9项最优,其性能超越采用更大模型的竞争方法。