Touch is the primary medium through which humans interact with the environment. Currently, tactile learning mainly focuses on image-level pretraining or alignment. However, tactile signals correspond to local object contact, while research into scale alignment and holographic matching remains limited and proper datasets and benchmarks also lack. To bridge this gap, we first construct a data collection system to acquire a large-scale tactile dataset, with over 20 K tactile contacts from 505 real-world objects. Building on this dataset, we design a Vis-Tac Holographic Matching Benchmark to evaluate vision-tactile local-to-global alignment ability. Then we propose Vision-Tactile Patch Alignment (VTPA) methods for vision-tactile representation learning. Experiments demonstrate that these exceed the performance of methods without alignment and align with whole-object images.
翻译:触觉是人类与环境交互的主要媒介。当前触觉学习主要集中于图像级预训练或对齐,但触觉信号对应局部物体接触,而关于尺度对齐与全息匹配的研究仍十分有限,且缺乏相应的数据集和基准。为弥补这一空白,我们首先构建了一个数据采集系统,从505个真实世界物体中获取超过20K次触觉接触,从而建立大规模触觉数据集。在此基础上,我们设计了视觉-触觉全息匹配基准(Vis-Tac Holographic Matching Benchmark)以评估视觉-触觉局部到全局对齐能力。随后,我们提出视觉-触觉Patch对齐方法(Vision-Tactile Patch Alignment, VTPA)用于视觉-触觉表征学习。实验表明,该方法超越了无对齐方法及与整物体图像对齐方法的性能。