Self-supervised learning (SSL) has revolutionized visual representation learning, but has not achieved the robustness of human vision. A reason for this could be that SSL does not leverage all the data available to humans during learning. When learning about an object, humans often purposefully turn or move around objects and research suggests that these interactions can substantially enhance their learning. Here we explore whether such object-related actions can boost SSL. For this, we extract the actions performed to change from one ego-centric view of an object to another in four video datasets. We then introduce a new loss function to learn visual and action embeddings by aligning the performed action with the representations of two images extracted from the same clip. This permits the performed actions to structure the latent visual representation. Our experiments show that our method consistently outperforms previous methods on downstream category recognition. In our analysis, we find that the observed improvement is associated with a better viewpoint-wise alignment of different objects from the same category. Overall, our work demonstrates that embodied interactions with objects can improve SSL of object categories.
翻译:自监督学习(SSL)已彻底改变了视觉表征学习领域,但其鲁棒性尚未达到人类视觉的水平。造成这一现象的原因可能是SSL未能充分利用人类在学习过程中可获取的全部数据。人类在学习认识物体时,常会有意识地转动或移动物体,研究表明此类交互行为能显著促进学习效果。本文旨在探究这类与物体相关的动作是否能提升SSL性能。为此,我们从四个视频数据集中提取了在物体自我中心视角转换过程中执行的动作,并引入一种新的损失函数,通过将执行动作与同一视频片段中提取的两幅图像表征进行对齐,从而学习视觉与动作嵌入表示。该方法使得执行动作能够结构化潜在视觉表征。实验结果表明,在下游类别识别任务中,我们的方法持续优于现有方法。分析发现,性能提升与同类物体在视角对齐方面的优化密切相关。总体而言,本研究证实了与物体的具身交互能够提升物体类别的自监督学习效果。