Self-Supervised Learning Based on Transformed Image Reconstruction for Equivariance-Coherent Feature Representation

Self-supervised learning (SSL) methods have achieved remarkable success in learning image representations allowing invariances in them - but therefore discarding transformation information that some computer vision tasks actually require. While recent approaches attempt to address this limitation by learning equivariant features using linear operators in feature space, they impose restrictive assumptions that constrain flexibility and generalization. We introduce a weaker definition for the transformation relation between image and feature space denoted as equivariance-coherence. We propose a novel SSL auxiliary task that learns equivariance-coherent representations through intermediate transformation reconstruction, which can be integrated with existing joint embedding SSL methods. Our key idea is to reconstruct images at intermediate points along transformation paths, e.g. when training on 30-degree rotations, we reconstruct the 10-degree and 20-degree rotation states. Reconstructing intermediate states requires the transformation information used in augmentations, rather than suppressing it, and therefore fosters features containing the augmented transformation information. Our method decomposes feature vectors into invariant and equivariant parts, training them with standard SSL losses and reconstruction losses, respectively. We demonstrate substantial improvements on synthetic equivariance benchmarks while maintaining competitive performance on downstream tasks requiring invariant representations. The approach seamlessly integrates with existing SSL methods (iBOT, DINOv2) and consistently enhances performance across diverse tasks, including segmentation, detection, depth estimation, and video dense prediction. Our framework provides a practical way for augmenting SSL methods with equivariant capabilities while preserving invariant performance.

翻译：自监督学习（SSL）方法在学习允许图像表示具有不变性的特征方面取得了显著成功，但同时也丢弃了某些计算机视觉任务实际所需的变换信息。尽管近期研究尝试通过在特征空间中使用线性算子学习等变特征来解决这一局限，但这些方法施加了限制性假设，约束了灵活性与泛化能力。我们引入了一种更弱的图像与特征空间间变换关系定义，称为等变性一致性。我们提出了一种新颖的SSL辅助任务，通过中间变换重建来学习等变性一致表示，该任务可与现有的联合嵌入SSL方法集成。我们的核心思想是沿变换路径重建中间点处的图像，例如在训练30度旋转时，我们重建10度和20度旋转状态。重建中间状态需要利用增强中使用的变换信息而非抑制它，从而促进特征包含增强变换信息。我们的方法将特征向量分解为不变部分和等变部分，分别用标准SSL损失和重建损失进行训练。我们在合成等变性基准测试中展示了显著改进，同时在下游需要不变表示的任务中保持竞争力。该方法可与现有SSL方法（iBOT、DINOv2）无缝集成，并在分割、检测、深度估计和视频密集预测等多样化任务中持续提升性能。我们的框架为增强SSL方法的等变能力同时保持不变性性能提供了一种实用途径。