同时触觉-视觉感知驱动的多模态机器人操作学习 (Simultaneous Tactile-Visual Perception for Learning Multimodal Robot Manipulation)

Robotic manipulation requires both rich multimodal perception and effective learning frameworks to handle complex real-world tasks. See-through-skin (STS) sensors, which combine tactile and visual perception, offer promising sensing capabilities, while modern imitation learning provides powerful tools for policy acquisition. However, existing STS designs lack simultaneous multimodal perception and suffer from unreliable tactile tracking. Furthermore, integrating these rich multimodal signals into learning-based manipulation pipelines remains an open challenge. We introduce TacThru, an STS sensor enabling simultaneous visual perception and robust tactile signal extraction, and TacThru-UMI, an imitation learning framework that leverages these multimodal signals for manipulation. Our sensor features a fully transparent elastomer, persistent illumination, novel keyline markers, and efficient tracking, while our learning system integrates these signals through a Transformer-based Diffusion Policy. Experiments on five challenging real-world tasks show that TacThru-UMI achieves an average success rate of 85.5%, significantly outperforming the baselines of tactile policy(66.3%) and vision-only policy (55.4%). The system excels in critical scenarios, including contact detection with thin and soft objects and precision manipulation requiring multimodal coordination. This work demonstrates that combining simultaneous multimodal perception with modern learning frameworks enables more precise, adaptable robotic manipulation.

翻译：机器人操作既需要丰富的多模态感知能力，也需要有效的学习框架来处理复杂的现实世界任务。透皮（STS）传感器融合了触觉与视觉感知，提供了前景广阔的传感能力，而现代模仿学习为策略获取提供了强大工具。然而，现有的STS设计缺乏同步多模态感知能力，且触觉追踪可靠性不足。此外，如何将这些丰富的多模态信号整合到基于学习的操作流程中仍是一个开放挑战。我们提出了TacThru——一种能够实现同步视觉感知与鲁棒触觉信号提取的STS传感器，以及TacThru-UMI——一个利用这些多模态信号进行操作的模仿学习框架。我们的传感器采用全透明弹性体、持久照明、新颖的键线标记和高效追踪技术，而我们的学习系统通过基于Transformer的扩散策略整合这些信号。在五项具有挑战性的现实任务上的实验表明，TacThru-UMI实现了85.5%的平均成功率，显著优于触觉策略基线（66.3%）和纯视觉策略基线（55.4%）。该系统在关键场景中表现卓越，包括对薄软物体的接触检测以及需要多模态协调的精密操作。本研究表明，将同步多模态感知与现代学习框架相结合，能够实现更精确、适应性更强的机器人操作。