Achieving autonomous robotic dexterous manipulation requires precise, human-like action sequences at scale. As a scalable supplement to costly teleoperation data, extracting trajectories with both visual fidelity and physical plausibility from monocular videos represents a promising frontier in embodied AI. To this end, we introduce V2P-Manip, an efficient framework designed to learn dexterous manipulation policies directly from human demonstration videos. We establish an efficient, integrated pipeline encompassing 3D asset acquisition, trajectory estimation, and dexterous policy learning. To bridge the gap between visual perception and physical constraints, we introduce a two-stage refinement process to enforce spatial alignment and physical consistency. Evaluations on the TACO and OakInk benchmarks demonstrate that our approach significantly outperforms previous methods in pose accuracy, adaptability to unstructured environments, and training efficiency. Ultimately, experimental results confirm an average success rate of over 75% across multiple synthetic manipulation tasks and validate the adaptability of the extracted manipulation priors across diverse dexterous hand embodiments.
翻译:实现自主机器人灵巧操作需要大规模、精准且类人的动作序列。作为昂贵遥操作数据的可扩展补充方案,从单目视频中提取兼具视觉保真度和物理合理性的轨迹,是具身智能领域的前沿方向。为此,我们提出V2P-Manip——一个直接从人类演示视频中学习灵巧操作策略的高效框架。我们构建了涵盖三维资产获取、轨迹估计和灵巧策略学习的集成化高效流水线。为弥合视觉感知与物理约束之间的鸿沟,我们引入两阶段精化流程以强制空间对齐与物理一致性。在TACO和OakInk基准上的评估表明,我们的方法在姿态精度、非结构化环境适应性及训练效率方面显著优于现有方法。最终,实验验证了该方法在多个合成操作任务中平均成功率超过75%,并证明了所提取操作先验在不同灵巧手构型间的可迁移性。