In this work, we introduce general purpose touch representations for the increasingly accessible class of vision-based tactile sensors. Such sensors have led to many recent advances in robot manipulation as they markedly complement vision, yet solutions today often rely on task and sensor specific handcrafted perception models. Collecting real data at scale with task centric ground truth labels, like contact forces and slip, is a challenge further compounded by sensors of various form factor differing in aspects like lighting and gel markings. To tackle this we turn to self-supervised learning (SSL) that has demonstrated remarkable performance in computer vision. We present Sparsh, a family of SSL models that can support various vision-based tactile sensors, alleviating the need for custom labels through pre-training on 460k+ tactile images with masking and self-distillation in pixel and latent spaces. We also build TacBench, to facilitate standardized benchmarking across sensors and models, comprising of six tasks ranging from comprehending tactile properties to enabling physical perception and manipulation planning. In evaluations, we find that SSL pre-training for touch representation outperforms task and sensor-specific end-to-end training by 95.1% on average over TacBench, and Sparsh (DINO) and Sparsh (IJEPA) are the most competitive, indicating the merits of learning in latent space for tactile images. Project page: https://sparsh-ssl.github.io/
翻译:本研究针对日益普及的视觉触觉传感器类别,提出了一种通用触觉表征方法。此类传感器通过显著补充视觉信息,已在机器人操作领域取得诸多进展,但现有解决方案通常依赖于针对特定任务和传感器的手工感知模型。由于不同形态传感器的照明条件和凝胶标记存在差异,大规模采集具有任务中心真实标签(如接触力和滑动)的真实数据面临严峻挑战。为应对此问题,我们转向在计算机视觉领域表现卓越的自监督学习(SSL)方法。我们提出Sparsh——一个能够适配多种视觉触觉传感器的SSL模型系列,通过在46万+触觉图像上进行像素空间与潜在空间的掩码自蒸馏预训练,有效减少了对定制化标签的依赖。同时,我们构建了TacBench基准测试平台,涵盖从触觉属性理解到物理感知与操作规划的六项任务,以促进跨传感器与模型的标准化评估。实验表明,在TacBench基准上,触觉表征的SSL预训练平均比任务与传感器特定的端到端训练性能提升95.1%,其中Sparsh(DINO)与Sparsh(IJEPA)表现最为优异,这印证了在潜在空间学习触觉图像表征的有效性。项目页面:https://sparsh-ssl.github.io/