The rapidly evolving field of robotics necessitates methods that can facilitate the fusion of multiple modalities. Specifically, when it comes to interacting with tangible objects, effectively combining visual and tactile sensory data is key to understanding and navigating the complex dynamics of the physical world, enabling a more nuanced and adaptable response to changing environments. Nevertheless, much of the earlier work in merging these two sensory modalities has relied on supervised methods utilizing datasets labeled by humans.This paper introduces MViTac, a novel methodology that leverages contrastive learning to integrate vision and touch sensations in a self-supervised fashion. By availing both sensory inputs, MViTac leverages intra and inter-modality losses for learning representations, resulting in enhanced material property classification and more adept grasping prediction. Through a series of experiments, we showcase the effectiveness of our method and its superiority over existing state-of-the-art self-supervised and supervised techniques. In evaluating our methodology, we focus on two distinct tasks: material classification and grasping success prediction. Our results indicate that MViTac facilitates the development of improved modality encoders, yielding more robust representations as evidenced by linear probing assessments.
翻译:机器人领域的快速发展亟需能够融合多种模态信息的方法。具体而言,在与实体物体交互时,有效结合视觉与触觉感知数据是理解与驾驭物理世界复杂动态的关键,从而实现对环境变化的更精细、更具适应性的响应。然而,早期融合这两种感官模态的研究大多依赖于人工标注数据集的监督方法。本文提出MViTac——一种利用对比学习以自监督方式整合视觉与触觉感知的创新方法。通过利用双模态输入,MViTac采用模态内与跨模态损失进行表征学习,从而提升材质属性分类精度与抓取预测能力。通过系列实验,我们展示了该方法的效果及其相较于现有最优自监督与监督技术的优越性。在评估过程中,我们聚焦两类任务:材质分类与抓取成功预测。实验结果表明,MViTac能够促进更优模态编码器的构建,线性探测评估证实其可生成更具鲁棒性的表征。