Augmentation-based self-supervised learning methods have shown remarkable success in self-supervised visual representation learning, excelling in learning invariant features but often neglecting equivariant ones. This limitation reduces the generalizability of foundation models, particularly for downstream tasks requiring equivariance. We propose integrating an image reconstruction task as an auxiliary component in augmentation-based self-supervised learning algorithms to facilitate equivariant feature learning without additional parameters. Our method implements a cross-attention mechanism to blend features learned from two augmented views, subsequently reconstructing one of them. This approach is adaptable to various datasets and augmented-pair based learning methods. We evaluate its effectiveness on learning equivariant features through multiple linear regression tasks and downstream applications on both artificial (3DIEBench) and natural (ImageNet) datasets. Results consistently demonstrate significant improvements over standard augmentation-based self-supervised learning methods and state-of-the-art approaches, particularly excelling in scenarios involving combined augmentations. Our method enhances the learning of both invariant and equivariant features, leading to more robust and generalizable visual representations for computer vision tasks.
翻译:增强型自监督学习方法在自监督视觉表征学习中取得了显著成功,擅长学习不变特征,但往往忽视等变特征。这一局限降低了基础模型的泛化能力,尤其对于需要等变性的下游任务。我们提出将图像重建任务作为增强型自监督学习算法的辅助组件,以促进等变特征学习而无需额外参数。该方法通过交叉注意力机制融合从两个增强视角学习到的特征,进而重建其中一个视角。此方法适用于多种数据集及基于增强对的学习方法。我们通过多元线性回归任务及人工(3DIEBench)与自然(ImageNet)数据集的下游应用评估其学习等变特征的有效性。结果一致表明,相较于标准增强型自监督学习方法及前沿方法,本方法均取得显著提升,尤其在涉及组合增强的场景中表现突出。该方法同时提升了不变特征与等变特征的学习能力,从而为计算机视觉任务生成更鲁棒、更具泛化性的视觉表征。