Tactility provides crucial support and enhancement for the perception and interaction capabilities of both humans and robots. Nevertheless, the multimodal research related to touch primarily focuses on visual and tactile modalities, with limited exploration in the domain of language. Beyond vocabulary, sentence-level descriptions contain richer semantics. Based on this, we construct a touch-language-vision dataset named TLV (Touch-Language-Vision) by human-machine cascade collaboration, featuring sentence-level descriptions for multimode alignment. The new dataset is used to fine-tune our proposed lightweight training framework, TLV-Link (Linking Touch, Language, and Vision through Alignment), achieving effective semantic alignment with minimal parameter adjustments (1%). Project Page: https://xiaoen0.github.io/touch.page/.
翻译:触觉为人类和机器人的感知与交互能力提供了关键支持与增强。然而,与触觉相关的多模态研究主要聚焦于视觉和触觉模态,对语言领域的探索较为有限。除词汇外,句子级描述蕴含更丰富的语义。基于此,我们通过人机级联协作构建了一个名为TLV(触觉-语言-视觉)的触觉-语言-视觉数据集,其特点在于采用句子级描述实现多模态对齐。该新数据集用于微调我们提出的轻量级训练框架TLV-Link(通过对齐连接触觉、语言与视觉),以最小参数调整(1%)实现有效的语义对齐。项目页面:https://xiaoen0.github.io/touch.page/。