Learning generalist policies capable of accomplishing a plethora of everyday tasks remains an open challenge in dexterous manipulation. In particular, collecting large-scale manipulation data via real-world teleoperation is expensive and difficult to scale. While learning in simulation provides a feasible alternative, designing multiple task-specific environments and rewards for training is similarly challenging. We propose Dex4D, a framework that instead leverages simulation for learning task-agnostic dexterous skills that can be flexibly recomposed to perform diverse real-world manipulation tasks. Specifically, Dex4D learns a domain-agnostic 3D point track conditioned policy capable of manipulating any object to any desired pose. We train this 'Anypose-to-Anypose' policy in simulation across thousands of objects with diverse pose configurations, covering a broad space of robot-object interactions that can be composed at test time. At deployment, this policy can be zero-shot transferred to real-world tasks without finetuning, simply by prompting it with desired object-centric point tracks extracted from generated videos. During execution, Dex4D uses online point tracking for closed-loop perception and control. Extensive experiments in simulation and on real robots show that our method enables zero-shot deployment for diverse dexterous manipulation tasks and yields consistent improvements over prior baselines. Furthermore, we demonstrate strong generalization to novel objects, scene layouts, backgrounds, and trajectories, highlighting the robustness and scalability of the proposed framework.
翻译:学习能够完成大量日常任务的通用策略仍然是灵巧操作领域的一个开放挑战。特别是,通过真实世界遥操作收集大规模操作数据成本高昂且难以扩展。虽然在仿真环境中学习提供了一种可行的替代方案,但设计多个任务特定的训练环境和奖励函数同样具有挑战性。我们提出了Dex4D框架,该框架转而利用仿真学习任务无关的灵巧技能,这些技能可以灵活重组以执行多样化的真实世界操作任务。具体而言,Dex4D学习一个领域无关的、以3D点轨迹为条件的策略,该策略能够操纵任意物体至任意期望位姿。我们在仿真环境中使用数千个具有不同位姿配置的物体训练这一“任意位姿到任意位姿”策略,覆盖了测试时可组合的广泛机器人-物体交互空间。在部署时,该策略可以通过从生成视频中提取的、以物体为中心的期望点轨迹进行提示,从而零样本迁移到真实世界任务中,无需微调。在执行过程中,Dex4D使用在线点跟踪实现闭环感知与控制。在仿真和真实机器人上进行的大量实验表明,我们的方法能够实现多样化灵巧操作任务的零样本部署,并相较于现有基线取得一致的性能提升。此外,我们展示了该方法对新颖物体、场景布局、背景和轨迹的强泛化能力,凸显了所提出框架的鲁棒性和可扩展性。