Being able to simulate the outcomes of actions in varied environments will revolutionize the development of generalist agents at scale. However, modeling these world dynamics, especially for dexterous robotics tasks, poses significant challenges due to limited data coverage and scarce action labels. As an endeavor towards this end, we introduce DreamDojo, a foundation world model that learns diverse interactions and dexterous controls from 44k hours of egocentric human videos. Our data mixture represents the largest video dataset to date for world model pretraining, spanning a wide range of daily scenarios with diverse objects and skills. To address the scarcity of action labels, we introduce continuous latent actions as unified proxy actions, enhancing interaction knowledge transfer from unlabeled videos. After post-training on small-scale target robot data, DreamDojo demonstrates a strong understanding of physics and precise action controllability. We also devise a distillation pipeline that accelerates DreamDojo to a real-time speed of 10.81 FPS and further improves context consistency. Our work enables several important applications based on generative world models, including live teleoperation, policy evaluation, and model-based planning. Systematic evaluation on multiple challenging out-of-distribution (OOD) benchmarks verifies the significance of our method for simulating open-world, contact-rich tasks, paving the way for general-purpose robot world models.
翻译:能够模拟不同环境中的行动结果将彻底改变通用智能体的大规模发展。然而,由于数据覆盖有限和动作标签稀缺,对这些世界动态进行建模(尤其是灵巧机器人任务)提出了重大挑战。为此,我们推出了DreamDojo——一个基础世界模型,它从44,000小时的第一人称人类视频中学习多样化的交互和灵巧控制。我们的数据混合代表了迄今为止用于世界模型预训练的最大视频数据集,涵盖了包含多种物体和技能的广泛日常场景。为应对动作标签稀缺的问题,我们引入了连续潜在动作作为统一的代理动作,以增强未标记视频中的交互知识迁移。经过小规模目标机器人数据的微调后,DreamDojo展现出对物理规律的深刻理解和精确的动作可控性。我们还设计了一个蒸馏流程,将DreamDojo加速至10.81 FPS的实时速度,并进一步提升了上下文一致性。我们的工作实现了基于生成式世界模型的若干重要应用,包括实时遥操作、策略评估和基于模型的规划。在多个具有挑战性的分布外(OOD)基准测试上的系统评估,验证了该方法在模拟开放世界、密集接触任务方面的重要意义,为通用机器人世界模型的发展开辟了道路。