Cosmos Policy：通过微调视频模型实现视觉运动控制与规划 (Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning)

Recent video generation models demonstrate remarkable ability to capture complex physical interactions and scene evolution over time. To leverage their spatiotemporal priors, robotics works have adapted video models for policy learning but introduce complexity by requiring multiple stages of post-training and new architectural components for action generation. In this work, we introduce Cosmos Policy, a simple approach for adapting a large pretrained video model (Cosmos-Predict2) into an effective robot policy through a single stage of post-training on the robot demonstration data collected on the target platform, with no architectural modifications. Cosmos Policy learns to directly generate robot actions encoded as latent frames within the video model's latent diffusion process, harnessing the model's pretrained priors and core learning algorithm to capture complex action distributions. Additionally, Cosmos Policy generates future state images and values (expected cumulative rewards), which are similarly encoded as latent frames, enabling test-time planning of action trajectories with higher likelihood of success. In our evaluations, Cosmos Policy achieves state-of-the-art performance on the LIBERO and RoboCasa simulation benchmarks (98.5% and 67.1% average success rates, respectively) and the highest average score in challenging real-world bimanual manipulation tasks, outperforming strong diffusion policies trained from scratch, video model-based policies, and state-of-the-art vision-language-action models fine-tuned on the same robot demonstrations. Furthermore, given policy rollout data, Cosmos Policy can learn from experience to refine its world model and value function and leverage model-based planning to achieve even higher success rates in challenging tasks. We release code, models, and training data at https://research.nvidia.com/labs/dir/cosmos-policy/

翻译：近期视频生成模型展现出捕捉复杂物理交互与场景时序演变的卓越能力。为利用其时空先验，机器人研究领域已尝试将视频模型应用于策略学习，但通常需要多阶段后训练及新增动作生成架构组件，从而引入复杂性。本文提出Cosmos Policy，这是一种将大型预训练视频模型（Cosmos-Predict2）通过单阶段后训练适配为高效机器人策略的简洁方法：仅需在目标平台采集的机器人演示数据上进行训练，无需修改模型架构。Cosmos Policy能够直接生成编码为视频模型潜在扩散过程中潜在帧的机器人动作，从而利用模型的预训练先验与核心学习算法来捕捉复杂的动作分布。此外，Cosmos Policy可生成未来状态图像与价值函数（预期累积奖励），这些信息同样被编码为潜在帧，使得在测试时能够规划更高成功概率的动作轨迹。实验评估表明，Cosmos Policy在LIBERO与RoboCasa仿真基准测试中分别达到98.5%与67.1%的平均成功率，取得最先进性能；在具挑战性的真实世界双手操作任务中获得最高平均分，其表现优于从头训练的强扩散策略、基于视频模型的策略，以及在相同机器人演示数据上微调的前沿视觉-语言-动作模型。进一步地，基于策略运行数据，Cosmos Policy能够从经验中学习以优化其世界模型与价值函数，并借助基于模型的规划在复杂任务中实现更高的成功率。代码、模型及训练数据发布于 https://research.nvidia.com/labs/dir/cosmos-policy/