Self-supervised representation learning (SSRL) methods have shown great success in computer vision. In recent studies, augmentation-based contrastive learning methods have been proposed for learning representations that are invariant or equivariant to pre-defined data augmentation operations. However, invariant or equivariant features favor only specific downstream tasks depending on the augmentations chosen. They may result in poor performance when the learned representation does not match task requirements. Here, we consider an active observer that can manipulate views of an object and has knowledge of the action(s) that generated each view. We introduce Contrastive Invariant and Predictive Equivariant Representation learning (CIPER). CIPER comprises both invariant and equivariant learning objectives using one shared encoder and two different output heads on top of the encoder. One output head is a projection head with a state-of-the-art contrastive objective to encourage invariance to augmentations. The other is a prediction head estimating the augmentation parameters, capturing equivariant features. Both heads are discarded after training and only the encoder is used for downstream tasks. We evaluate our method on static image tasks and time-augmented image datasets. Our results show that CIPER outperforms a baseline contrastive method on various tasks. Interestingly, CIPER encourages the formation of hierarchically structured representations where different views of an object become systematically organized in the latent representation space.
翻译:自监督表示学习方法在计算机视觉领域取得了巨大成功。近年来,基于数据增强的对比学习方法被提出用于学习对预定义数据增强操作具有不变性或等变性的表示。然而,不变或等变特征仅根据所选增强方式适用于特定下游任务。当所学表示与任务需求不匹配时,可能导致性能下降。本文考虑一个能够主动操作物体视角并知晓生成每个视角的具体动作的观察者。我们提出对比不变与预测等变表示学习(CIPER)方法。CIPER 采用共享编码器与两个独立输出头,同时实现不变性和等变性学习目标:一个输出头为投影头,采用最先进的对比学习目标以增强对数据增强的不变性;另一个为预测头,用于估计增强参数以捕获等变特征。训练完成后,两个输出头均被舍弃,仅保留编码器用于下游任务。我们在静态图像任务和时序增强图像数据集上评估方法性能。结果表明,CIPER 在多种任务中优于基准对比方法。值得关注的是,CIPER 能够促进层次化结构化表示的形成,使物体的不同视角在潜在表示空间中获得系统性的组织。