Most invariance-based self-supervised methods rely on single object-centric images (e.g., ImageNet images) for pretraining, learning invariant representations from geometric transformations. However, when images are not object-centric, the semantics of the image can be significantly altered due to cropping. Furthermore, as the model becomes insensitive to geometric transformations, it may struggle to capture location information. For this reason, we propose a Geometric Transformation Sensitive Architecture designed to learn features that are sensitive to geometric transformations, specifically focusing on four-fold rotation, random crop, and multi-crop. Our method encourages the student to be sensitive by using targets that are sensitive to those transforms via pooling and rotating of the teacher feature map and predicting rotation. Additionally, as training insensitively to multi-crop encourages local-to-global correspondence, the model can capture long-term dependencies. We use patch correspondence loss to encourage correspondence between patches with similar features, instead of enforcing correspondence between views of the image. This approach allows us to capture long-term dependencies in a more appropriate way. Our approach demonstrates improved performance when using non-object-centric images as pretraining data compared to other methods that learn geometric transformation-insensitive representations. We surpass the DINO baseline in tasks including image classification, semantic segmentation, detection, and instance segmentation with improvements of 4.9 $Top-1 Acc$, 3.3 $mIoU$, 3.4 $AP^b$, and 2.7 $AP^m$. Code and pretrained models are publicly available at: https://github.com/bok3948/GTSA
翻译:大多数基于不变性的自监督方法依赖于单个物体中心图像(例如ImageNet图像)进行预训练,通过几何变换学习不变表示。然而,当图像非物体中心时,裁剪操作可能显著改变图像语义。此外,当模型对几何变换不敏感时,可能难以捕获位置信息。为此,我们提出一种几何变换敏感架构,旨在学习对几何变换敏感的特征,重点关注四倍旋转、随机裁剪和多裁剪操作。该方法通过池化与旋转教师特征图并预测旋转,生成对上述变换敏感的目标,从而促使学生模型保持敏感性。此外,由于对多裁剪进行不敏感训练会促进局部到全局的对应关系,模型能够捕获长期依赖关系。我们使用块对应损失替代图像视图间的对应约束,鼓励具有相似特征的块之间建立对应关系,从而更恰当地捕获长期依赖。与学习几何变换不敏感表示的其他方法相比,我们的方法在使用非物体中心图像进行预训练时展现出更优性能。在图像分类、语义分割、目标检测与实例分割任务中,我们分别以4.9 Top-1准确率、3.3 mIoU、3.4 AP^b和2.7 AP^m的提升超越了DINO基线。代码与预训练模型已开源至:https://github.com/bok3948/GTSA