OmniVTLA: Vision-Tactile-Language-Action Models with Semantic-Aligned Tactile Sensing

Recent vision-language-action (VLA) models build upon vision-language foundations, and have achieved promising results and exhibit the possibility of task generalization in robot manipulation. However, due to the heterogeneity of tactile sensors and the difficulty of acquiring tactile data, current VLA models significantly overlook the importance of tactile perception and fail in contact-rich tasks. To address this issue, this paper proposes OmniVTLA, a novel architecture involving tactile sensing. Specifically, our contributions are threefold. First, our OmniVTLA features a dual-path tactile encoder framework. This framework enhances tactile perception across diverse vision-based and force-based tactile sensors by using a pretrained vision transformer (ViT) and a semantically-aligned tactile ViT (SA-ViT). Second, we introduce ObjTac, a comprehensive force-based tactile dataset capturing textual, visual, and tactile information for 56 objects across 10 categories. With 135K tri-modal samples, ObjTac supplements existing visuo-tactile datasets. Third, leveraging this dataset, we train a semantically-aligned tactile encoder to learn a unified tactile representation, serving as a better initialization for OmniVTLA. Real-world experiments demonstrate substantial improvements over state-of-the-art VLA baselines, achieving 96.9% success rates with grippers, (21.9% higher over baseline) and 100% success rates with dexterous hands (6.2% higher over baseline) in pick-and-place tasks. Besides, OmniVTLA significantly reduces task completion time and generates smoother trajectories through tactile sensing compared to existing VLA. Our ObjTac dataset can be found at https://readerek.github.io/Objtac.github.io

翻译：近期视觉-语言-动作（VLA）模型以视觉-语言基础模型为根基，在机器人操作任务中取得了显著成果并展现出任务泛化的可能性。然而，由于触觉传感器的异构性及触觉数据获取困难，现有VLA模型严重忽视了触觉感知的重要性，在高接触任务中表现不佳。针对该问题，本文提出集成触觉感知的新型架构OmniVTLA。具体而言，本文贡献包含三方面：第一，OmniVTLA采用双路径触觉编码器框架，通过预训练视觉Transformer（ViT）与语义对齐触觉ViT（SA-ViT），增强对基于视觉和基于力的多种触觉传感器的感知能力；第二，我们构建了ObjTac综合型力觉触觉数据集，涵盖10个类别56个物体的文本、视觉与触觉信息，包含13.5万组三模态样本，补充了现有视觉-触觉数据集；第三，利用该数据集训练语义对齐触觉编码器以学习统一触觉表征，为OmniVTLA提供更优初始化。真实实验表明，在抓取放置任务中，本方法较现有最先进VLA基线方法取得显著提升：使用夹爪时成功率高达96.9%（较基线提升21.9%），使用灵巧手时达100%（较基线提升6.2%）。此外，与现有VLA方法相比，OmniVTLA通过触觉感知显著缩短任务完成时间并生成更平滑的运动轨迹。本研究的ObjTac数据集详见https://readerek.github.io/Objtac.github.io