Self-Supervised Learning for Real-World Object Detection: a Survey

Self-Supervised Learning (SSL) has emerged as a promising approach in computer vision, enabling networks to learn meaningful representations from large unlabeled datasets. SSL methods fall into two main categories: instance discrimination and Masked Image Modeling (MIM). While instance discrimination is fundamental to SSL, it was originally designed for classification and may be less effective for object detection, particularly for small objects. In this survey, we focus on SSL methods specifically tailored for real-world object detection, with an emphasis on detecting small objects in complex environments. Unlike previous surveys, we offer a detailed comparison of SSL strategies, including object-level instance discrimination and MIM methods, and assess their effectiveness for small object detection using both CNN and ViT-based architectures. Specifically, our benchmark is performed on the widely-used COCO dataset, as well as on a specialized real-world dataset focused on vehicle detection in infrared remote sensing imagery. We also assess the impact of pre-training on custom domain-specific datasets, highlighting how certain SSL strategies are better suited for handling uncurated data. Our findings highlight that instance discrimination methods perform well with CNN-based encoders, while MIM methods are better suited for ViT-based architectures and custom dataset pre-training. This survey provides a practical guide for selecting optimal SSL strategies, taking into account factors such as backbone architecture, object size, and custom pre-training requirements. Ultimately, we show that choosing an appropriate SSL pre-training strategy, along with a suitable encoder, significantly enhances performance in real-world object detection, particularly for small object detection in frugal settings.

翻译：自监督学习（SSL）已成为计算机视觉领域一种前景广阔的方法，它使网络能够从大规模无标注数据集中学习有意义的表征。SSL方法主要分为两大类：实例判别与掩码图像建模（MIM）。虽然实例判别是SSL的基础方法，但其最初是为分类任务设计的，在目标检测（尤其是小目标检测）中可能效果欠佳。本综述聚焦于专门针对真实世界目标检测任务设计的SSL方法，并重点关注复杂环境中的小目标检测。与以往综述不同，我们对SSL策略（包括对象级实例判别与MIM方法）进行了详细比较，并评估了它们在基于CNN和ViT架构的小目标检测任务中的有效性。具体而言，我们的基准测试在广泛使用的COCO数据集以及一个专注于红外遥感图像中车辆检测的专用真实世界数据集上进行。我们还评估了在定制化领域特定数据集上进行预训练的影响，阐明了某些SSL策略在处理非精选数据时的优势。我们的研究结果表明，实例判别方法在基于CNN的编码器上表现良好，而MIM方法更适合基于ViT的架构及定制数据集的预训练。本综述为选择最优SSL策略提供了实用指南，其中需考虑骨干网络架构、目标尺寸及定制预训练需求等因素。最终，我们证明选择合适的SSL预训练策略及配套编码器，能显著提升真实世界目标检测（尤其是在资源受限环境下的小目标检测）的性能。