Insertion tasks in robotic manipulation demand precise, contact-rich interactions that vision alone cannot resolve. While tactile feedback is intuitively valuable, existing studies have shown that naïve visuo-tactile fusion often fails to deliver consistent improvements. In this work, we propose a Cross-Modal Transformer (CMT) for visuo-tactile fusion that integrates wrist-camera observations with tactile signals through structured self- and cross-attention. To stabilize tactile embeddings, we further introduce a physics-informed regularization that encourages bilateral force balance, reflecting principles of human motor control. Experiments on the TacSL benchmark show that CMT with symmetry regularization achieves a 96.59% insertion success rate, surpassing naïve and gated fusion baselines and closely matching the privileged "wrist + contact force" configuration (96.09%). These results highlight two central insights: (i) tactile sensing is indispensable for precise alignment, and (ii) principled multimodal fusion, further strengthened by physics-informed regularization, unlocks complementary strengths of vision and touch, approaching privileged performance under realistic sensing.
翻译:机器人操作中的插入任务需要精确且接触密集的交互,仅凭视觉无法解决。虽然触觉反馈具有直观价值,但现有研究表明,简单的视觉-触觉融合往往无法带来一致的性能提升。在本工作中,我们提出了一种用于视觉-触觉融合的跨模态Transformer(CMT),它通过结构化的自注意力与交叉注意力机制,将腕部摄像头观测与触觉信号相集成。为稳定触觉嵌入,我们进一步引入了一种基于物理知识的正则化方法,该方法鼓励双边力平衡,反映了人类运动控制原理。在TacSL基准测试上的实验表明,采用对称正则化的CMT实现了96.59%的插入成功率,超越了简单融合与门控融合基线,并与特权配置“腕部摄像头+接触力”(96.09%)的性能高度接近。这些结果凸显了两个核心观点:(i)触觉感知对于精确对准不可或缺;(ii)基于原理的多模态融合,辅以物理知识正则化的增强,能够释放视觉与触觉的互补优势,在现实传感条件下逼近特权配置的性能。