Training image-based object detectors presents formidable challenges, as it entails not only the complexities of object detection but also the added intricacies of precisely localizing objects within potentially diverse and noisy environments. However, the collection of imagery itself can often be straightforward; for instance, cameras mounted in vehicles can effortlessly capture vast amounts of data in various real-world scenarios. In light of this, we introduce a groundbreaking method for training single-stage object detectors through unsupervised/self-supervised learning. Our state-of-the-art approach has the potential to revolutionize the labeling process, substantially reducing the time and cost associated with manual annotation. Furthermore, it paves the way for previously unattainable research opportunities, particularly for large, diverse, and challenging datasets lacking extensive labels. In contrast to prevalent unsupervised learning methods that primarily target classification tasks, our approach takes on the unique challenge of object detection. We pioneer the concept of intra-image contrastive learning alongside inter-image counterparts, enabling the acquisition of crucial location information essential for object detection. The method adeptly learns and represents this location information, yielding informative heatmaps. Our results showcase an outstanding accuracy of \textbf{89.2\%}, marking a significant breakthrough of approximately \textbf{15x} over random initialization in the realm of unsupervised object detection within the field of computer vision.
翻译:训练基于图像的目标检测器面临着严峻挑战,这不仅涉及目标检测本身的复杂性,还包含在潜在多样且嘈杂环境中精确定位目标的额外难题。然而,图像数据的采集往往较为便捷;例如,车载摄像头可在各类真实场景中轻松获取海量数据。基于此,我们提出一种开创性方法,通过无监督/自监督学习训练单阶段目标检测器。这一顶尖方法有望革新标注流程,显著降低人工注释所需的时间与成本。此外,它为以往难以实现的研究机遇铺平道路,尤其适用于缺乏大量标注的大规模、多样化且具挑战性的数据集。与主要针对分类任务的常见无监督学习方法不同,我们的方法直面目标检测这一独特挑战。我们率先提出图像内对比学习与图像间对比学习相结合的概念,从而获取目标检测所需的关键位置信息。该方法能有效学习并表征此类位置信息,生成信息丰富的热力图。实验结果表明,我们取得了**89.2%** 的卓越准确率,相较于随机初始化实现了约**15倍**的显著突破,标志着计算机视觉领域无监督目标检测的重要进展。