Most self-supervised learning (SSL) methods often work on curated datasets where the object-centric assumption holds. This assumption breaks down in uncurated images. Existing scene image SSL methods try to find the two views from original scene images that are well matched or dense, which is both complex and computationally heavy. This paper proposes a conceptually different pipeline: first find regions that are coarse objects (with adequate objectness), crop them out as pseudo object-centric images, then any SSL method can be directly applied as in a real object-centric dataset. That is, coarse crops benefits scene images SSL. A novel cropping strategy that produces coarse object box is proposed. The new pipeline and cropping strategy successfully learn quality features from uncurated datasets without ImageNet. Experiments show that our pipeline outperforms existing SSL methods (MoCo-v2, DenseCL and MAE) on classification, detection and segmentation tasks. We further conduct extensively ablations to verify that: 1) the pipeline do not rely on pretrained models; 2) the cropping strategy is better than existing object discovery methods; 3) our method is not sensitive to hyperparameters and data augmentations.
翻译:大多数自监督学习方法通常适用于以物体为中心的整理数据集,该假设在非整理图像中不成立。现有场景图像自监督方法试图从原始场景图像中寻找高度匹配或密集对应的两个视图,这既复杂又计算量大。本文提出一种概念上不同的流程:首先定位具有足够物体性的粗粒度区域,裁剪出伪物体中心图像,随后任意自监督方法即可像在真实物体中心数据集上一样直接应用。即粗粒度裁剪有益于场景图像自监督学习。我们提出一种新型裁剪策略生成粗粒度物体框。该新流程与裁剪策略无需ImageNet即可从非整理数据集中学习高质量特征。实验表明,我们的流程在分类、检测和分割任务上优于现有自监督方法(MoCo-v2、DenseCL和MAE)。我们进一步通过大量消融实验验证:1)该流程不依赖预训练模型;2)裁剪策略优于现有物体发现方法;3)我们的方法对超参数和数据增强不敏感。