We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we propose to randomly crop and resize regions of positional embeddings instead of using the whole image positional embeddings. This better matches the use of positional embeddings at region-level in the detection finetuning phase. In addition, we replace the common softmax cross entropy loss in contrastive learning with focal loss to better learn the informative yet difficult examples. Finally, we leverage recent advances in novel object proposals to improve open-vocabulary detection finetuning. We evaluate our full model on the LVIS and COCO open-vocabulary detection benchmarks and zero-shot transfer. RO-ViT achieves a state-of-the-art 34.1 $AP_r$ on LVIS, surpassing the best existing approach by +7.8 points in addition to competitive zero-shot transfer detection. Surprisingly, RO-ViT improves the image-level representation as well and achieves the state of the art on 9 out of 12 metrics on COCO and Flickr image-text retrieval benchmarks, outperforming competitive approaches with larger models.
翻译:我们提出区域感知的开放词汇视觉Transformer (RO-ViT)——一种用于弥合图像级预训练与开放词汇目标检测之间差距的对比式图像-文本预训练方案。在预训练阶段,我们提出随机裁剪并调整位置编码区域,而非使用整图位置编码。这一方法能更好地匹配检测微调阶段中区域级位置编码的使用方式。此外,我们将对比学习中常用的Softmax交叉熵损失替换为Focal损失,以更有效地学习信息丰富但困难的样本。最后,我们利用新颖目标提议的最新进展改进开放词汇检测微调。在LVIS和COCO开放词汇检测基准及零样本迁移任务中,RO-ViT在LVIS上实现了34.1的先进$AP_r$值,较现有最优方法提升7.8个百分点,并取得具有竞争力的零样本迁移检测性能。令人瞩目的是,RO-ViT同时提升了图像级表征,在COCO和Flickr图像-文本检索基准的12项指标中9项达到最优水平,超越了采用更大模型的竞争方法。