Most invariance-based self-supervised methods rely on single object-centric images (e.g., ImageNet images) for pretraining, learning invariant features from geometric transformations. However, when images are not object-centric, the semantics of the image can be significantly altered due to cropping. Furthermore, as the model becomes insensitive to geometric transformations, it may struggle to capture location information. For this reason, we propose a Geometric Transformation Sensitive Architecture designed to be sensitive to geometric transformations, specifically focusing on four-fold rotation, random crop, and multi-crop. Our method encourages the student to be sensitive by predicting rotation and using targets that vary with those transformations through pooling and rotating the teacher feature map. Additionally, we use patch correspondence loss to encourage correspondence between patches with similar features. This approach allows us to capture long-term dependencies in a more appropriate way than capturing long-term dependencies by encouraging local-to-global correspondence, which occurs when learning to be insensitive to multi-crop. Our approach demonstrates improved performance when using non-object-centric images as pretraining data compared to other methods that train the model to be insensitive to geometric transformation. We surpass DINO[\citet{caron2021emerging}] baseline in tasks including image classification, semantic segmentation, detection, and instance segmentation with improvements of 4.9 $Top-1 Acc$, 3.3 $mIoU$, 3.4 $AP^b$, and 2.7 $AP^m$. Code and pretrained models are publicly available at: \url{https://github.com/bok3948/GTSA}
翻译:大多数基于不变性的自监督方法依赖单目标中心图像(如ImageNet图像)进行预训练,通过学习几何变换下的不变特征。然而,当图像并非目标中心时,裁剪操作可能显著改变图像语义。此外,随着模型对几何变换变得不敏感,其捕捉位置信息的能力可能受限。为此,我们提出一种几何变换敏感架构,旨在对几何变换保持敏感性,尤其关注四重旋转、随机裁剪及多裁剪策略。该方法通过预测旋转角度,并利用教师特征图的池化与旋转操作生成随变换变化的目标,促使学生模型保持敏感性。同时,我们采用补丁对应损失以增强特征相似补丁间的对应关系。与通过促进局部到全局对应(即学习对多裁剪不敏感时产生的现象)来捕捉长程依赖不同,我们的方法能以更恰当的方式捕获长程依赖。实验表明,当使用非目标中心图像作为预训练数据时,相较于其他训练模型对几何变换不敏感的方法,本方法取得了更优性能。在图像分类、语义分割、目标检测及实例分割任务中,我们分别以4.9的Top-1准确率、3.3的mIoU、3.4的AP^b及2.7的AP^m超越DINO[\citet{caron2021emerging}]基线。代码与预训练模型已公开发布于:\url{https://github.com/bok3948/GTSA}