Humans make extensive use of vision and touch as complementary senses, with vision providing global information about the scene and touch measuring local information during manipulation without suffering from occlusions. While prior work demonstrates the efficacy of tactile sensing for precise manipulation of deformables, they typically rely on supervised, human-labeled datasets. We propose Self-Supervised Visuo-Tactile Pretraining (SSVTP), a framework for learning multi-task visuo-tactile representations in a self-supervised manner through cross-modal supervision. We design a mechanism that enables a robot to autonomously collect precisely spatially-aligned visual and tactile image pairs, then train visual and tactile encoders to embed these pairs into a shared latent space using cross-modal contrastive loss. We apply this latent space to downstream perception and control of deformable garments on flat surfaces, and evaluate the flexibility of the learned representations without fine-tuning on 5 tasks: feature classification, contact localization, anomaly detection, feature search from a visual query (e.g., garment feature localization under occlusion), and edge following along cloth edges. The pretrained representations achieve a 73-100% success rate on these 5 tasks.
翻译:人类广泛利用视觉和触觉作为互补感知模态,视觉提供场景全局信息,触觉则在操作过程中不受遮挡影响地测量局部信息。尽管已有研究证明了触觉感知在可变形体精细操作中的有效性,但这些方法通常依赖于带人工标注的有监督数据集。我们提出自监督视觉-触觉预训练框架(SSVTP),通过跨模态监督以自监督方式学习多任务视觉-触觉表征。我们设计了一种机制,使机器人能够自主采集空间精确对齐的视觉与触觉图像对,随后利用跨模态对比损失训练视觉与触觉编码器,将这些图像对嵌入共享隐空间。我们将该隐空间应用于可变形服装在平面上的下游感知与控制,并在不进行微调的情况下评估所学表征在5项任务中的灵活性:特征分类、接触定位、异常检测、基于视觉查询的特征搜索(例如遮挡条件下的服装特征定位)以及沿布料边缘的跟踪。预训练表征在这5项任务上实现了73-100%的成功率。