Open-vocabulary object detection has benefited greatly from pretrained vision-language models, but is still limited by the amount of available detection training data. While detection training data can be expanded by using Web image-text pairs as weak supervision, this has not been done at scales comparable to image-level pretraining. Here, we scale up detection data with self-training, which uses an existing detector to generate pseudo-box annotations on image-text pairs. Major challenges in scaling self-training are the choice of label space, pseudo-annotation filtering, and training efficiency. We present the OWLv2 model and OWL-ST self-training recipe, which address these challenges. OWLv2 surpasses the performance of previous state-of-the-art open-vocabulary detectors already at comparable training scales (~10M examples). However, with OWL-ST, we can scale to over 1B examples, yielding further large improvement: With an L/14 architecture, OWL-ST improves AP on LVIS rare classes, for which the model has seen no human box annotations, from 31.2% to 44.6% (43% relative improvement). OWL-ST unlocks Web-scale training for open-world localization, similar to what has been seen for image classification and language modelling.
翻译:开放词汇目标检测已从预训练的视觉-语言模型中受益良多,但仍受限于可用的检测训练数据量。尽管可以通过使用网络图像-文本对作为弱监督来扩充检测训练数据,但迄今尚未达到与图像级预训练相当的规模。本文通过自训练方法扩展检测数据,即利用现有检测器在图像-文本对上生成伪框注释。自训练规模化面临的主要挑战包括标签空间选择、伪注释筛选和训练效率。我们提出的OWLv2模型与OWL-ST自训练方案有效解决了这些难题。在约1000万样本的同等训练规模下,OWLv2已超越先前最先进的开放词汇检测器性能。而借助OWL-ST方法,我们可将规模扩展至超10亿样本,实现显著提升:采用L/14架构时,OWL-ST在LVIS稀有类别(模型未见任何人工框标注的类别)上的AP从31.2%提升至44.6%(相对提升43%)。OWL-ST为开放世界定位解锁了网络规模训练的可能性,这类似于图像分类和语言建模领域已取得的突破。