Humans rely on the synergy of their senses for most essential tasks. For tasks requiring object manipulation, we seamlessly and effectively exploit the complementarity of our senses of vision and touch. This paper draws inspiration from such capabilities and aims to find a systematic approach to fuse visual and tactile information in a reinforcement learning setting. We propose Masked Multimodal Learning (M3L), which jointly learns a policy and visual-tactile representations based on masked autoencoding. The representations jointly learned from vision and touch improve sample efficiency, and unlock generalization capabilities beyond those achievable through each of the senses separately. Remarkably, representations learned in a multimodal setting also benefit vision-only policies at test time. We evaluate M3L on three simulated environments with both visual and tactile observations: robotic insertion, door opening, and dexterous in-hand manipulation, demonstrating the benefits of learning a multimodal policy. Code and videos of the experiments are available at https://sferrazza.cc/m3l_site.
翻译:人类在大多数基本任务中依赖于多种感官的协同作用。对于需要物体操控的任务,我们能够无缝且高效地利用视觉和触觉的互补性。本文从这一能力中汲取灵感,旨在找到一种系统化的方法,在强化学习环境中融合视觉与触觉信息。我们提出掩码多模态学习(Masked Multimodal Learning, M3L),该方法基于掩码自编码联合学习策略及视觉-触觉表征。通过视觉与触觉联合学习的表征提升了样本效率,并解锁了超越单一感官所能实现的泛化能力。值得注意的是,多模态环境下习得的表征在测试阶段同样有益于纯视觉策略。我们在三个包含视觉与触觉观测的模拟环境中评估了M3L——机器人插入、开门以及灵巧手内操控,证明了学习多模态策略的优势。实验代码与视频见https://sferrazza.cc/m3l_site。