Robotic foundation models pre-trained on human demonstration videos have shown promise, but a significant embodiment gap remains when the resulting policies are deployed on real robots. A common remedy is to fine-tune these models on robot-specific demonstrations. However, robot data collection can be prohibitively expensive and time-consuming, which is particularly acute in dexterous manipulation, e.g., teleoperating a multi-fingered hand for even a single atomic task can take days. To address this, we introduce Dexterous Point Policy, a framework that learns dexterous manipulation policies directly from human videos and requires no robot demonstrations. Our core insight is that a unified 3D keypoint representation can bridge human and robot embodiments when used for both observations and actions. Specifically, we extract 3D keypoints of task-relevant objects and human hands from raw videos, and train an autoregressive transformer over these keypoints. We observe that at the keypoint level, specifically the wrist and fingertips, human and robot behaviors closely align, enabling direct policy transfer. On a suite of real-robot tasks spanning pick-and-place and tool use, Dexterous Point Policy attains 75.0% success, whereas a state-of-the-art VLA baseline reaches only 1.0%. Furthermore, our method generalizes strongly to unseen scenarios, including multi-object environments and novel object categories.
翻译:机器人基础模型通过人类演示视频进行预训练已展现出潜力,但当策略部署到真实机器人上时仍存在显著的具身差距。常见的补救方法是在机器人特定演示数据上微调这些模型。然而,机器人数据采集可能成本高昂且耗时,这在灵巧操作领域尤为突出——例如,遥操作多指手完成单个原子任务可能需要数天时间。为解决这一问题,我们提出灵活点策略,一个直接从人类视频中学习灵巧操作策略且无需机器人演示的框架。我们的核心洞察在于,当统一的三维关键点表示同时用于观测和动作时,可以弥合人类与机器人具身之间的鸿沟。具体而言,我们从原始视频中提取任务相关物体和人类手部的三维关键点,并训练一个基于这些关键点的自回归Transformer。我们观察到,在关键点层面(特别是手腕和指尖),人类与机器人的行为高度一致,从而实现了直接的策略迁移。在一系列涵盖抓取放置和工具使用的真实机器人任务中,灵活点策略达到了75.0%的成功率,而最先进的VLA基线方法仅达到1.0%。此外,我们的方法对未见场景(包括多物体环境和新型物体类别)展现出强大的泛化能力。