MOSAIC: Learning Unified Multi-Sensory Object Property Representations for Robot Learning via Interactive Perception

A holistic understanding of object properties across diverse sensory modalities (e.g., visual, audio, and haptic) is essential for tasks ranging from object categorization to complex manipulation. Drawing inspiration from cognitive science studies that emphasize the significance of multi-sensory integration in human perception, we introduce MOSAIC (Multimodal Object property learning with Self-Attention and Interactive Comprehension), a novel framework designed to facilitate the learning of unified multi-sensory object property representations. While it is undeniable that visual information plays a prominent role, we acknowledge that many fundamental object properties extend beyond the visual domain to encompass attributes like texture, mass distribution, or sounds, which significantly influence how we interact with objects. In MOSAIC, we leverage this profound insight by distilling knowledge from multimodal foundation models and aligning these representations not only across vision but also haptic and auditory sensory modalities. Through extensive experiments on a dataset where a humanoid robot interacts with 100 objects across 10 exploratory behaviors, we demonstrate the versatility of MOSAIC in two task families: object categorization and object-fetching tasks. Our results underscore the efficacy of MOSAIC's unified representations, showing competitive performance in category recognition through a simple linear probe setup and excelling in the fetch object task under zero-shot transfer conditions. This work pioneers the application of sensory grounding in foundation models for robotics, promising a significant leap in multi-sensory perception capabilities for autonomous systems. We have released the code, datasets, and additional results: https://github.com/gtatiya/MOSAIC.

翻译：对物体属性的整体理解——涉及视觉、听觉、触觉等多种感官模态——对于从物体分类到复杂操作等任务至关重要。受认知科学研究中强调多感官整合在人类感知中重要性的启发，我们提出了MOSAIC（基于自注意与交互理解的多模态物体属性学习）框架，旨在促进统一多感官物体属性表征的学习。尽管视觉信息无疑占据主导地位，但我们承认许多基础物体属性（如质地、质量分布或声响）超出了视觉范畴，显著影响着我们与物体的交互方式。在MOSAIC中，我们通过从多模态基础模型中提炼知识，并将这些表征不仅在视觉模态间对齐，还延伸至触觉与听觉模态，从而利用了这一深刻洞见。通过在包含人形机器人对100个物体执行10种探索行为的数据库上开展广泛实验，我们验证了MOSAIC在物体分类与取物两类任务中的多样性。我们的结果凸显了MOSAIC统一表征的有效性：在类别识别任务中，通过简单的线性探测设置即展现出竞争性能；在零样本迁移条件下的取物任务中表现卓越。本工作开创了将感官接地应用于机器人基础模型的先河，有望显著提升自主系统的多感官感知能力。相关代码、数据集及补充结果已公开：https://github.com/gtatiya/MOSAIC。