The ability to associate touch with other modalities has huge implications for humans and computational systems. However, multimodal learning with touch remains challenging due to the expensive data collection process and non-standardized sensor outputs. We introduce UniTouch, a unified tactile model for vision-based touch sensors connected to multiple modalities, including vision, language, and sound. We achieve this by aligning our UniTouch embeddings to pretrained image embeddings already associated with a variety of other modalities. We further propose learnable sensor-specific tokens, allowing the model to learn from a set of heterogeneous tactile sensors, all at the same time. UniTouch is capable of conducting various touch sensing tasks in the zero-shot setting, from robot grasping prediction to touch image question answering. To the best of our knowledge, UniTouch is the first to demonstrate such capabilities. Project page: https://cfeng16.github.io/UniTouch/
翻译:触觉与其他模态的关联能力对人类和计算系统具有重大意义。然而,由于数据采集成本高昂及传感器输出缺乏标准化,多模态触觉学习仍面临挑战。我们提出UniTouch——一种面向视觉触觉传感器的统一触觉模型,可连接视觉、语言和声音等多种模态。通过将UniTouch嵌入与预训练的、已关联多种其他模态的图像嵌入对齐,我们实现了这一目标。我们进一步提出可学习的传感器特定标记,使模型能够同时从一组异构触觉传感器中学习。UniTouch支持在零样本设置下执行多种触觉感知任务,涵盖机器人抓取预测与触觉图像问答等场景。据我们所知,UniTouch是首个展现此类能力的模型。项目页面:https://cfeng16.github.io/UniTouch/