Most invariance-based self-supervised methods rely on single object-centric images (e.g., ImageNet images) for pretraining, learning invariant representations from geometric transformations. However, when images are not object-centric, the semantics of the image can be significantly altered due to cropping. Furthermore, as the model becomes insensitive to geometric transformations, it may struggle to capture location information. For this reason, we propose a Geometric Transformation Sensitive Architecture designed to learn features that are sensitive to geometric transformations, specifically focusing on four-fold rotation, random crop, and multi-crop. Our method encourages the student to be sensitive by using targets that are sensitive to those transforms via pooling and rotating of the teacher feature map and predicting rotation. Additionally, as training insensitively to multi-crop encourages local-to-global correspondence, the model can capture long-term dependencies. We use patch correspondence loss to encourage correspondence between patches with similar features, instead of enforcing correspondence between views of the image. This approach allows us to capture long-term dependencies in a more appropriate way. Our approach demonstrates improved performance when using non-object-centric images as pretraining data compared to other methods that learn geometric transformation-insensitive representations. We surpass the DINO baseline in tasks including image classification, semantic segmentation, detection, and instance segmentation with improvements of 4.9 $Top-1 Acc$, 3.3 $mIoU$, 3.4 $AP^b$, and 2.7 $AP^m$. Code and pretrained models are publicly available at: https://github.com/bok3948/GTSA
翻译:大多数基于不变性的自监督方法依赖单目标中心图像(如ImageNet图像)进行预训练,通过几何变换学习不变表征。然而,当图像非目标中心时,裁剪操作可能显著改变图像语义。此外,当模型对几何变换不敏感时,其捕捉位置信息的能力可能受限。为此,我们提出一种几何变换敏感架构,旨在学习对几何变换敏感的特征,重点关注四倍旋转、随机裁剪和多裁剪。该方法通过池化与旋转教师特征图,并预测旋转角度,使学生模型利用对这些变换敏感的标签保持敏感性。同时,由于对多裁剪进行不敏感训练可促进局部到全局的对应关系,模型能够捕捉长程依赖。我们采用补丁对应损失鼓励具有相似特征的补丁间的对应关系,而非强制图像视图间的对应。该方法使我们能以更恰当的方式捕捉长程依赖。与学习几何变换不敏感表征的其他方法相比,本方法在使用非目标中心图像作为预训练数据时展现出更优性能。在图像分类、语义分割、目标检测和实例分割任务上,我们分别以4.9% Top-1准确率、3.3% mIoU、3.4% AP^b和2.7% AP^m的提升超越DINO基线。代码和预训练模型公开于:https://github.com/bok3948/GTSA