Most invariance-based self-supervised methods rely on single object-centric images (e.g., ImageNet images) for pretraining, learning invariant representations from geometric transformations. However, when images are not object-centric, the semantics of the image can be significantly altered due to cropping. Furthermore, as the model learns geometrically insensitive features, it may struggle to capture location information. For this reason, we propose a Geometric Transformation Sensitive Architecture that learns features sensitive to geometric transformations, specifically four-fold rotation, random crop, and multi-crop. Our method encourages the student to learn sensitive features by using targets that are sensitive to those transforms via pooling and rotating of the teacher feature map and predicting rotation. Additionally, since training insensitively to multi-crop can capture long-term dependencies, we use patch correspondence loss to train the model sensitively while capturing long-term dependencies. Our approach demonstrates improved performance when using non-object-centric images as pretraining data compared to other methods that learn geometric transformation-insensitive representations. We surpass the DINO[\citet{caron2021emerging}] baseline in tasks including image classification, semantic segmentation, detection, and instance segmentation with improvements of 6.1 $Acc$, 3.3 $mIoU$, 3.4 $AP^b$, and 2.7 $AP^m$. Code and pretrained models are publicly available at:
翻译:大多数基于不变性的自监督方法依赖于单一物体中心图像(如ImageNet图像)进行预训练,并从几何变换中学习不变表示。然而,当图像并非物体中心时,裁剪操作可能显著改变图像的语义信息。此外,由于模型学习的是几何不敏感特征,它可能难以捕获位置信息。为此,我们提出一种几何变换敏感架构,用于学习对几何变换(特别是四倍旋转、随机裁剪和多裁剪)敏感的特征。我们的方法通过池化和旋转教师特征图并预测旋转操作,使学生模型能够利用对这些变换敏感的目标来学习敏感特征。此外,由于对多裁剪进行不敏感训练可捕获长程依赖关系,我们采用块对应损失来训练模型在捕获长程依赖的同时保持敏感性。实验表明,与学习几何变换不敏感表示的其他方法相比,我们的方法在使用非物体中心图像作为预训练数据时展现出更优性能。在图像分类、语义分割、目标检测和实例分割任务中,我们超越了DINO[\citet{caron2021emerging}]基线,分别提升了6.1 $Acc$、3.3 $mIoU$、3.4 $AP^b$和2.7 $AP^m$。相关代码与预训练模型已公开发布于: