A core component of the recent success of self-supervised learning is cropping data augmentation, which selects sub-regions of an image to be used as positive views in the self-supervised loss. The underlying assumption is that randomly cropped and resized regions of a given image share information about the objects of interest, which the learned representation will capture. This assumption is mostly satisfied in datasets such as ImageNet where there is a large, centered object, which is highly likely to be present in random crops of the full image. However, in other datasets such as OpenImages or COCO, which are more representative of real world uncurated data, there are typically multiple small objects in an image. In this work, we show that self-supervised learning based on the usual random cropping performs poorly on such datasets. We propose replacing one or both of the random crops with crops obtained from an object proposal algorithm. This encourages the model to learn both object and scene level semantic representations. Using this approach, which we call object-aware cropping, results in significant improvements over scene cropping on classification and object detection benchmarks. For example, on OpenImages, our approach achieves an improvement of 8.8% mAP over random scene-level cropping using MoCo-v2 based pre-training. We also show significant improvements on COCO and PASCAL-VOC object detection and segmentation tasks over the state-of-the-art self-supervised learning approaches. Our approach is efficient, simple and general, and can be used in most existing contrastive and non-contrastive self-supervised learning frameworks.
翻译:自监督学习近期成功的一个核心组成部分是裁剪数据增强,即选择图像的子区域作为自监督损失中的正视图。其潜在假设是:给定图像的随机裁剪和调整大小的区域共享关于目标对象的信息,而学习到的表示将捕获这些信息。这一假设在ImageNet等数据集中基本成立,该类数据集包含大型、居中的对象,且此类对象极有可能出现在整张图像的随机裁剪中。然而,在OpenImages或COCO等其他更具真实世界未整理数据代表性的数据集中,图像中通常包含多个小对象。本研究表明,基于常规随机裁剪的自监督学习在此类数据集上表现不佳。我们提出用目标提议算法获得的裁剪替换一个或两个随机裁剪。这促使模型同时学习对象级和场景级的语义表示。采用这种称为面向对象裁剪的方法,在分类和对象检测基准测试中相较于场景裁剪取得了显著改进。例如,在OpenImages上,我们的方法基于MoCo-v2预训练,比随机场景级别裁剪实现了8.8% mAP的提升。我们还在COCO和PASCAL-VOC对象检测与分割任务上,相较于最先进的自监督学习方法取得了显著改进。该方法高效、简单且通用,可应用于大多数现有对比式和非对比式自监督学习框架。