Open-vocabulary object detection has benefited greatly from pretrained vision-language models, but is still limited by the amount of available detection training data. While detection training data can be expanded by using Web image-text pairs as weak supervision, this has not been done at scales comparable to image-level pretraining. Here, we scale up detection data with self-training, which uses an existing detector to generate pseudo-box annotations on image-text pairs. Major challenges in scaling self-training are the choice of label space, pseudo-annotation filtering, and training efficiency. We present the OWLv2 model and OWL-ST self-training recipe, which address these challenges. OWLv2 surpasses the performance of previous state-of-the-art open-vocabulary detectors already at comparable training scales (~10M examples). However, with OWL-ST, we can scale to over 1B examples, yielding further large improvement: With an L/14 architecture, OWL-ST improves AP on LVIS rare classes, for which the model has seen no human box annotations, from 31.2% to 44.6% (43% relative improvement). OWL-ST unlocks Web-scale training for open-world localization, similar to what has been seen for image classification and language modelling.
翻译:开放词汇目标检测受益于预训练的视觉语言模型,但仍受限于可用的检测训练数据量。尽管可以通过使用网络图像-文本对作为弱监督来扩展检测训练数据,但尚未在图像级预训练可比的规模上实现。在此,我们利用自训练方法扩展检测数据,即使用现有检测器为图像-文本对生成伪框标注。扩展自训练的主要挑战包括标签空间选择、伪标注过滤以及训练效率。我们提出了OWLv2模型和OWL-ST自训练方案,以应对这些挑战。在可比训练规模(约1千万样本)下,OWLv2已超越此前最先进的开放词汇检测器性能。然而,借助OWL-ST,我们可将规模扩展至超过10亿样本,取得进一步大幅提升:采用L/14架构时,OWL-ST将LVIS稀有类别的AP(模型未见过任何人工框标注)从31.2%提升至44.6%(相对提升43%)。OWL-ST为开放世界定位解锁了网络规模训练,这与图像分类和语言建模领域的发展趋势一致。