We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we propose to randomly crop and resize regions of positional embeddings instead of using the whole image positional embeddings. This better matches the use of positional embeddings at region-level in the detection finetuning phase. In addition, we replace the common softmax cross entropy loss in contrastive learning with focal loss to better learn the informative yet difficult examples. Finally, we leverage recent advances in novel object proposals to improve open-vocabulary detection finetuning. We evaluate our full model on the LVIS and COCO open-vocabulary detection benchmarks and zero-shot transfer. RO-ViT achieves a state-of-the-art 32.4 $AP_r$ on LVIS, surpassing the best existing approach by +6.1 points in addition to competitive zero-shot transfer detection. Surprisingly, RO-ViT improves the image-level representation as well and achieves the state of the art on 9 out of 12 metrics on COCO and Flickr image-text retrieval benchmarks, outperforming competitive approaches with larger models.
翻译:我们提出区域感知开放词汇视觉Transformer(RO-ViT)——一种对比式图像-文本预训练方案,旨在弥合图像级预训练与开放词汇目标检测之间的差距。在预训练阶段,我们提出随机裁剪并调整位置编码的区域,而非使用整张图像的位置编码。这更贴合检测微调阶段中区域级位置编码的使用方式。此外,我们用焦点损失替代对比学习中常用的Softmax交叉熵损失,以更好地学习信息丰富但困难的样本。最后,我们利用新目标提议的最新进展来改进开放词汇检测微调。我们在LVIS和COCO开放词汇检测基准及零样本迁移任务上评估了完整模型。RO-ViT在LVIS数据集上实现了32.4%的AP_r,超越现有最佳方法6.1个百分点,同时取得具有竞争力的零样本迁移检测性能。令人惊讶的是,RO-ViT还改善了图像级表示,在COCO和Flickr图像-文本检索基准的12项指标中9项达到最优,优于使用更大模型的竞争方法。