Existing robotic manipulation methods primarily rely on visual and proprioceptive observations, which may struggle to infer contact-related interaction states in partially observable real-world environments. Acoustic cues, by contrast, naturally encode rich interaction dynamics during contact, yet remain underexploited in current multimodal fusion literature. Most multimodal fusion approaches implicitly assume homogeneous roles across modalities, and thus design flat and symmetric fusion structures. However, this assumption is ill-suited for acoustic signals, which are inherently sparse and contact-driven. To achieve precise robotic manipulation through acoustic-informed perception, we propose a hierarchical representation fusion framework that progressively integrates audio, vision, and proprioception. Our approach first conditions visual and proprioceptive representations on acoustic cues, and then explicitly models higher-order cross-modal interactions to capture complementary dependencies among modalities. The fused representation is leveraged by a diffusion-based policy to directly generate continuous robot actions from multimodal observations. The combination of end-to-end learning and hierarchical fusion structure enables the policy to exploit task-relevant acoustic information while mitigating interference from less informative modalities. The proposed method has been evaluated on real-world robotic manipulation tasks, including liquid pouring and cabinet opening. Extensive experiment results demonstrate that our approach consistently outperforms state-of-the-art multimodal fusion frameworks, particularly in scenarios where acoustic cues provide task-relevant information not readily available from visual observations alone. Furthermore, a mutual information analysis is conducted to interpret the effect of audio cues in robotic manipulation via multimodal fusion.
翻译:现有机器人操控方法主要依赖视觉和本体感知观测,在部分可观测的真实环境中可能难以推断与接触相关的交互状态。相比之下,声学线索在接触过程中天然编码了丰富的交互动态,但在当前多模态融合研究中仍未得到充分探索。大多数多模态融合方法隐含假设各模态具有同质作用,因而设计平面对称的融合结构。然而,这种假设并不适用于本质上具有稀疏性和接触驱动特性的声学信号。为实现通过声学感知驱动的精确机器人操控,我们提出了一种分层表征融合框架,逐步整合音频、视觉和本体感知信息。该方法首先以声学线索为条件构建视觉与本体感知表征,随后显式建模高阶跨模态交互以捕捉模态间的互补依赖关系。融合后的表征被基于扩散的策略所利用,直接从多模态观测中生成连续的机器人动作。端到端学习与分层融合结构的结合使策略能够利用任务相关的声学信息,同时减轻信息量较少模态的干扰。所提方法已在真实世界机器人操控任务(包括液体倾倒和柜门开启)中得到验证。大量实验结果表明,我们的方法始终优于最先进的多模态融合框架,尤其在声学线索提供视觉观测难以获取的任务相关信息时表现更为突出。此外,本文通过互信息分析来阐释声学线索通过多模态融合对机器人操控产生的影响。