Touch sensing is beneficial for solving a wide variety of manipulation tasks. While there exists a wide range of tactile sensors with different properties, exploiting the fusion of multiple heterogeneous tactile sensors to improve manipulation learning remains underexplored. We present Multi-Resolution Tactile Sensing (MiTaS), a representation framework that leverages multiple tactile sensors operating at different temporal resolutions in order to solve complex contact-rich manipulation tasks. We propose a novel architecture using modality-specific convolutional stems and transformer-based fusion that effectively fuses information from an RGB camera stream, a vision-based GelSight Mini sensor and a high-frequency event-based Evetac sensor. This multi-sensor representation then conditions a flow-matching policy for solving downstream tasks. Experimental results across five contact-rich manipulation tasks demonstrate the effectiveness of multi-resolution tactile features in imitation learning. MiTaS achieves an average success rate of 80 %, while vision-only (31 %) and visual-tactile (54 %) baselines cannot solve the task reliably. Co-training a visuo-tactile model with multi-tactile data boosts performance by over 10 \% in certain tasks, without having access to the Evetac sensor during policy evaluation. A detailed sensor-reading and attention analysis reveals the importance of different sensors throughout task execution, validating our multi-resolution tactile sensing approach. Project Page: http://mitas-touch.github.io.
翻译:触觉感知对解决多种操作任务具有重要意义。尽管现有多种不同特性的触觉传感器,但利用多种异质触觉传感器的融合来改进操作学习仍待深入探索。我们提出多分辨率触觉感知(MiTaS)框架,通过整合运行于不同时间分辨率的多个触觉传感器,解决复杂的高接触性操作任务。我们设计了一种新型架构,采用模态特定卷积主干网络与基于Transformer的融合模块,有效融合来自RGB摄像头流、基于视觉的GelSight Mini传感器以及高频事件型Evetac传感器的信息。该多传感器表征随后用于引导流匹配策略完成下游任务。在五个高接触性操作任务上的实验结果表明,多分辨率触觉特征在模仿学习中具有显著效果。MiTaS实现了平均80%的成功率,而纯视觉基线(31%)和视觉-触觉基线(54%)无法可靠完成任务。通过多触觉数据联合训练视觉-触觉模型,可在某些任务中提升超10%性能,且策略评估时无需访问Evetac传感器。详细的传感器读数与注意力分析揭示了不同传感器在任务执行过程中的重要性,验证了我们的多分辨率触觉感知方法。项目页面:http://mitas-touch.github.io。