Self-supervised learning (SSL) has revolutionized visual representation learning, but has not achieved the robustness of human vision. A reason for this could be that SSL does not leverage all the data available to humans during learning. When learning about an object, humans often purposefully turn or move around objects and research suggests that these interactions can substantially enhance their learning. Here we explore whether such object-related actions can boost SSL. For this, we extract the actions performed to change from one ego-centric view of an object to another in four video datasets. We then introduce a new loss function to learn visual and action embeddings by aligning the performed action with the representations of two images extracted from the same clip. This permits the performed actions to structure the latent visual representation. Our experiments show that our method consistently outperforms previous methods on downstream category recognition. In our analysis, we find that the observed improvement is associated with a better viewpoint-wise alignment of different objects from the same category. Overall, our work demonstrates that embodied interactions with objects can improve SSL of object categories.
翻译:自监督学习(SSL)已彻底改变了视觉表征学习领域,但尚未达到人类视觉的鲁棒性。其原因可能在于SSL未能充分利用人类学习过程中可获取的全部数据。人类在学习物体认知时,常会主动围绕物体转动或移动,研究表明此类交互行为能显著促进学习效果。本文旨在探究这种与物体相关的动作是否能增强SSL性能。为此,我们从四个视频数据集中提取了改变物体自我中心视角的动作序列,并提出了新的损失函数,通过对齐执行动作与同一视频片段中两帧图像的表征来学习视觉和动作嵌入。该方法使得执行动作能够结构化潜在视觉表征。实验结果表明,我们的方法在下游类别识别任务中持续优于现有方法。分析发现,性能提升与同类物体在不同视角下的表征对齐度改善密切相关。本研究总体表明,与物体的具身交互能够提升物体类别的自监督学习效果。