Multi-finger robotic hand manipulation and grasping are challenging due to the high-dimensional action space and the difficulty of acquiring large-scale training data. Existing approaches largely rely on human teleoperation with wearable devices or specialized sensing equipment to capture hand-object interactions, which limits scalability. In this work, we propose VIDEOMANIP, a device-free framework that learns dexterous manipulation directly from RGB human videos. Leveraging recent advances in computer vision, VIDEOMANIP reconstructs explicit 3D robot-object trajectories from monocular videos by estimating human hand poses, object meshes, and retargets the reconstructed human motions to robotic hands for manipulation learning. To make the reconstructed robot data suitable for dexterous manipulation training, we introduce hand-object contact optimization with interaction-centric grasp modeling, as well as a demonstration synthesis strategy that generates diverse training trajectories from a single video, enabling generalizable policy learning without additional robot demonstrations. In simulation, the learned grasping model achieves a 70.25% success rate across 20 diverse objects using the Inspire Hand. In the real world, manipulation policies trained from RGB videos achieve an average 62.86% success rate across seven tasks using the LEAP Hand, outperforming retargeting-based methods by 15.87%. Project videos are available at videomanip.github.io.
翻译:多指机器人手操作与抓取任务因高维动作空间及大规模训练数据获取困难而充满挑战。现有方法主要依赖穿戴设备或专用传感设备进行人体遥操作以捕捉手-物体交互,这限制了方法的可扩展性。本工作提出VIDEOMANIP——一种无需专用设备的框架,可直接从RGB人体视频中学习灵巧操作。该框架借助计算机视觉领域的最新进展,通过估计人体手部姿态与物体网格,从单目视频重建显式的机器人-物体三维轨迹,并将重建的人体运动重定向至机器人手以进行操作学习。为使重建的机器人数据适用于灵巧操作训练,我们引入了基于交互中心抓握建模的手-物体接触优化方法,以及从单段视频生成多样化训练轨迹的演示合成策略,从而在无需额外机器人演示的情况下实现可泛化的策略学习。在仿真环境中,使用Inspire Hand学习的抓握模型在20种不同物体上达到70.25%的成功率。在现实世界中,基于RGB视频训练的操作策略使用LEAP Hand在七项任务中平均获得62.86%的成功率,较基于重定向的方法提升15.87%。项目视频详见videomanip.github.io。