Embodied intelligence has advanced rapidly in recent years; however, bimanual manipulation-especially in contact-rich tasks remains challenging. This is largely due to the lack of datasets with rich physical interaction signals, systematic task organization, and sufficient scale. To address these limitations, we introduce the VTOUCH dataset. It leverages vision based tactile sensing to provide high-fidelity physical interaction signals, adopts a matrix-style task design to enable systematic learning, and employs automated data collection pipelines covering real-world, demand-driven scenarios to ensure scalability. To further validate the effectiveness of the dataset, we conduct extensive quantitative experiments on cross-modal retrieval as well as real-robot evaluation. Finally, we demonstrate real-world performance through generalizable inference across multiple robots, policies, and tasks.
翻译:具身智能近年来取得了快速进展,然而双手操作——尤其是在高接触性任务中——仍面临挑战。这主要归因于缺乏具备丰富物理交互信号、系统性任务组织以及足够规模的数据集。为解决这些局限,我们提出了VTOUCH数据集。该数据集利用基于视觉的触觉感知技术提供高保真物理交互信号,采用矩阵式任务设计以支持系统化学习,并借助自动化数据采集流程覆盖真实世界、需求驱动场景,确保可扩展性。为进一步验证数据集的有效性,我们在跨模态检索以及真实机器人评估方面开展了大量定量实验。最后,我们通过在多种机器人、策略及任务上的泛化推理,展示了真实世界的性能表现。