RGB-S: Image-Aligned Tactile Saliency for Robust Dexterous Manipulation

Effective visuo-tactile integration is critical for robotic dexterous manipulation, especially when visual observations are unreliable or occluded. However, robustly aligning sparse, heterogeneous tactile measurements with dense visual representations remains a fundamental challenge. Most existing approaches require policies to learn cross-modal correspondences implicitly from limited demonstrations, without leveraging geometric priors. As a result, they are often data-inefficient and generalize poorly when visual observations are degraded. To address this limitation, we propose a framework that explicitly grounds physical contacts in the image domain. Using robot forward kinematics and camera calibration, we project tactile sensor locations directly onto the RGB image plane. We then render force-modulated Gaussian saliency maps to model spatial uncertainty arising from kinematic and calibration errors. By integrating these 2D spatial anchors through a zero-initialized conditioning architecture, our method injects physical contact priors into standard visual backbones while preserving pre-trained visual representations. We evaluate our method on six dexterous manipulation tasks in both simulation and the real world under severe visual occlusions. Real-world experiments show that explicit RGB-S grounding in the image domain improves real-world occluded manipulation success rates by $26.7$ percentage points over the strongest implicit visuo-tactile baseline, suggesting its improved spatial reasoning and robustness to occlusion. Project page: touch-as-saliency.github.io

翻译：有效的视觉-触觉融合对于机器人灵巧操作至关重要，尤其是在视觉观测不可靠或被遮挡的情况下。然而，将稀疏、异质的触觉测量与密集的视觉表征进行鲁棒对齐仍是一个根本性挑战。现有方法大多要求策略从有限演示中隐式学习跨模态对应关系，而未利用几何先验。因此，当视觉观测退化时，它们往往数据效率低下且泛化能力差。为解决这一局限，我们提出一个框架，将物理接触显式地锚定在图像域中。利用机器人正向运动学和相机标定，我们将触觉传感器位置直接投影到RGB图像平面。随后，我们渲染力调制的二维高斯显著性图，以建模由运动学和标定误差引起的空间不确定性。通过零初始化条件架构集成这些二维空间锚点，我们的方法在保持预训练视觉表征的同时，将物理接触先验注入标准视觉骨干网络。我们在仿真和真实世界中评估了六项灵巧操作任务，均在严重视觉遮挡条件下进行。真实世界实验表明，在图像域中显式进行RGB-S对齐，使真实世界遮挡环境下的操作成功率比最强隐式视觉-触觉基线提高26.7个百分点，表明其空间推理能力与对遮挡的鲁棒性均得到显著改善。项目主页：touch-as-saliency.github.io