Precision assembly requires sub-millimeter corrections in contact-rich "last-millimeter" regions where visual feedback fails due to occlusion from the end-effector and workpiece. We present ReTac-ACT (Reconstruction-enhanced Tactile ACT), a vision-tactile imitation learning policy that addresses this challenge through three synergistic mechanisms: (i) bidirectional cross-attention enabling reciprocal visuo-tactile feature enhancement before fusion, (ii) a proprioception-conditioned gating network that dynamically elevates tactile reliance when visual occlusion occurs, and (iii) a tactile reconstruction objective enforcing learning of manipulation-relevant contact information rather than generic visual textures. Evaluated on the standardized NIST Assembly Task Board M1 benchmark, ReTac-ACT achieves 90% peg-in-hole success, substantially outperforming vision-only and generalist baseline methods, and maintains 80% success at industrial-grade 0.1mm clearance. Ablation studies validate that each architectural component is indispensable. The ReTac-ACT codebase and a vision-tactile demonstration dataset covering various clearance levels with both visual and tactile features will be released to support reproducible research.
翻译:精密装配需要在接触密集的"最后一毫米"区域内进行亚毫米级校正,而该区域因末端执行器与工件的遮挡导致视觉反馈失效。本文提出ReTac-ACT(重建增强型触觉ACT),这是一种视觉-触觉模仿学习策略,通过三种协同机制应对该挑战:(i)双向交叉注意力机制,在融合前实现视觉与触觉特征的相互增强;(ii)本体感知条件门控网络,在视觉遮挡发生时动态提升对触觉信息的依赖;(iii)触觉重建目标,强制学习与操作相关的接触信息而非通用视觉纹理。在标准化的NIST装配任务板M1基准测试中,ReTac-ACT实现了90%的轴孔装配成功率,显著优于纯视觉及通用基线方法,并在工业级0.1毫米间隙条件下保持80%的成功率。消融实验验证了每个架构组件均不可或缺。我们将发布ReTac-ACT代码库及涵盖多间隙等级的视觉-触觉演示数据集(包含视觉与触觉特征),以支持可重复研究。