Robotic manipulation has seen rapid progress with vision-language-action (VLA) policies. However, visuo-tactile perception is critical for contact-rich manipulation, as tasks such as insertion are difficult to complete robustly using vision alone. At the same time, acquiring large-scale and reliable tactile data in the physical world remains costly and challenging, and the lack of a unified evaluation platform further limits policy learning and systematic analysis. To address these challenges, we propose UniVTAC, a simulation-based visuo-tactile data synthesis platform that supports three commonly used visuo-tactile sensors and enables scalable and controllable generation of informative contact interactions. Based on this platform, we introduce the UniVTAC Encoder, a visuo-tactile encoder trained on large-scale simulation-synthesized data with designed supervisory signals, providing tactile-centric visuo-tactile representations for downstream manipulation tasks. In addition, we present the UniVTAC Benchmark, which consists of eight representative visuo-tactile manipulation tasks for evaluating tactile-driven policies. Experimental results show that integrating the UniVTAC Encoder improves average success rates by 17.1% on the UniVTAC Benchmark, while real-world robotic experiments further demonstrate a 25% improvement in task success. Our webpage is available at https://univtac.github.io/.
翻译:机器人操作在视觉-语言-动作(VLA)策略的推动下取得了快速进展。然而,视觉-触觉感知对于接触丰富的操作任务至关重要,因为诸如插入等任务仅依靠视觉难以鲁棒地完成。与此同时,在物理世界中获取大规模可靠的触觉数据仍然成本高昂且充满挑战,而统一评估平台的缺失进一步限制了策略学习和系统分析。为应对这些挑战,我们提出了UniVTAC,一个基于仿真的视觉-触觉数据合成平台,支持三种常用视觉-触觉传感器,并能实现信息丰富的接触交互的可扩展与可控生成。基于此平台,我们引入了UniVTAC编码器,这是一个通过大规模仿真合成数据并配合设计的监督信号训练而成的视觉-触觉编码器,为下游操作任务提供以触觉为中心的视觉-触觉表征。此外,我们提出了UniVTAC基准测试,包含八项具有代表性的视觉-触觉操作任务,用于评估触觉驱动的策略。实验结果表明,集成UniVTAC编码器可将UniVTAC基准测试的平均成功率提升17.1%,而真实世界机器人实验进一步证明了任务成功率有25%的提升。我们的项目网页地址为:https://univtac.github.io/。