We tackle the challenging task of unsupervised object localization in this work. Recently, transformers trained with self-supervised learning have been shown to exhibit object localization properties without being trained for this task. In this work, we present Multiple Object localization with Self-supervised Transformers (MOST) that uses features of transformers trained using self-supervised learning to localize multiple objects in real world images. MOST analyzes the similarity maps of the features using box counting; a fractal analysis tool to identify tokens lying on foreground patches. The identified tokens are then clustered together, and tokens of each cluster are used to generate bounding boxes on foreground regions. Unlike recent state-of-the-art object localization methods, MOST can localize multiple objects per image and outperforms SOTA algorithms on several object localization and discovery benchmarks on PASCAL-VOC 07, 12 and COCO20k datasets. Additionally, we show that MOST can be used for self-supervised pre-training of object detectors, and yields consistent improvements on fully, semi-supervised object detection and unsupervised region proposal generation.
翻译:本文针对无监督目标定位这一具有挑战性的任务展开研究。近期研究表明,通过自监督学习训练的Transformer模型能够展现出目标定位能力,尽管其并未针对该任务进行专门训练。为此,我们提出基于自监督Transformer的多目标定位方法MOST,利用自监督学习训练的Transformer特征对真实图像中的多个目标进行定位。MOST采用分形分析工具——盒计数法分析特征相似性图谱,识别位于前景区域的语义标记。这些被识别的标记随后进行聚类,各聚类中的标记用于在前景区域生成边界框。与当前最先进的目标定位方法不同,MOST能够定位单张图像中的多个目标,并在PASCAL-VOC 07、12及COCO20k等多个数据集的目标定位与发现基准测试中超越现有算法。此外,我们证明MOST可用于目标检测器的自监督预训练,并在全监督、半监督目标检测及无监督区域提议生成任务中取得一致性的性能提升。