Although pre-training on a large amount of data is beneficial for robot learning, current paradigms only perform large-scale pretraining for visual representations, whereas representations for other modalities are trained from scratch. In contrast to the abundance of visual data, it is unclear what relevant internet-scale data may be used for pretraining other modalities such as tactile sensing. Such pretraining becomes increasingly crucial in the low-data regimes common in robotics applications. In this paper, we address this gap by using contact microphones as an alternative tactile sensor. Our key insight is that contact microphones capture inherently audio-based information, allowing us to leverage large-scale audio-visual pretraining to obtain representations that boost the performance of robotic manipulation. To the best of our knowledge, our method is the first approach leveraging large-scale multisensory pre-training for robotic manipulation. For supplementary information including videos of real robot experiments, please see https://sites.google.com/view/hearing-touch.
翻译:尽管在大规模数据上进行预训练有益于机器人学习,但当前范式仅针对视觉表征进行大规模预训练,而其他模态的表征仍需从头开始训练。与海量视觉数据相比,目前尚不清楚可采用哪些相关互联网规模数据来预训练触觉等其他模态。这种预训练在机器人应用中常见的小样本场景中愈发关键。本文通过使用接触式麦克风作为替代触觉传感器来填补这一空白。我们的核心洞察在于:接触式麦克风天然捕捉基于音频的信息,从而可利用大规模视听预训练获得提升机器人操作性能的表征。据我们所知,本方法首次实现了面向机器人操作的大规模多感官预训练。补充信息(含真实机器人实验视频)请见https://sites.google.com/view/hearing-touch。