Large-scale, high-quality multimodal demonstrations are essential for robot learning of contact-rich dexterous manipulation. While human-centric data collection systems lower the barrier to scaling, they struggle to capture the tactile information during physical interactions. Motivated by this, we present DexViTac, a portable, human-centric data collection system tailored for contact-rich dexterous manipulation. The system enables the high-fidelity acquisition of first-person vision, high-density tactile sensing, end-effector poses, and hand kinematics within unstructured, in-the-wild environments. Building upon this hardware, we propose a kinematics-grounded tactile representation learning algorithm that effectively resolves semantic ambiguities within tactile signals. Leveraging the efficiency of DexViTac, we construct a multimodal dataset comprising over 2,400 visuo-tactile-kinematic demonstrations. Experiments demonstrate that DexViTac achieves a collection efficiency exceeding 248 demonstrations per hour and remains robust against complex visual occlusions. Real-world deployment confirms that policies trained with the proposed dataset and learning strategy achieve an average success rate exceeding 85% across four challenging tasks. This performance significantly outperforms baseline methods, thereby validating the substantial improvement the system provides for learning contact-rich dexterous manipulation. Project page: https://xitong-c.github.io/DexViTac/.
翻译:大规模、高质量的多模态演示对于机器人学习接触丰富的灵巧操作至关重要。虽然以人为中心的数据采集系统降低了规模化门槛,但在捕捉物理交互过程中的触觉信息方面存在困难。受此启发,我们提出了DexViTac——一个专为接触丰富灵巧操作设计的便携式、以人为中心的数据采集系统。该系统能够在非结构化、真实环境中高保真地采集第一人称视觉、高密度触觉传感、末端执行器位姿及手部运动学数据。基于该硬件,我们提出了一种以运动学为基底的触觉表征学习算法,有效解决了触觉信号中的语义歧义问题。利用DexViTac的高效性,我们构建了一个包含2400多个视觉-触觉-运动学演示的多模态数据集。实验表明,DexViTac的采集效率超过每小时248个演示,且对复杂的视觉遮挡具有鲁棒性。实际部署验证,使用该数据集与学习策略训练的策略在四项挑战性任务中实现了超过85%的平均成功率。该性能显著优于基线方法,从而验证了该系统对学习接触丰富灵巧操作能力的实质性提升。项目页面: https://xitong-c.github.io/DexViTac/