在Veo世界模拟器中评估Gemini机器人策略 (Evaluating Gemini Robotics Policies in a Veo World Simulator)

Gemini Robotics Team,Krzysztof Choromanski,Coline Devin,Yilun Du,Debidatta Dwibedi,Ruiqi Gao,Abhishek Jindal,Thomas Kipf,Sean Kirmani,Isabel Leal,Fangchen Liu,Anirudha Majumdar,Andrew Marmon,Carolina Parada,Yulia Rubanova,Dhruv Shah,Vikas Sindhwani,Jie Tan,Fei Xia,Ted Xiao,Sherry Yang,Wenhao Yu,Allan Zhou

Generative world models hold significant potential for simulating interactions with visuomotor policies in varied environments. Frontier video models can enable generation of realistic observations and environment interactions in a scalable and general manner. However, the use of video models in robotics has been limited primarily to in-distribution evaluations, i.e., scenarios that are similar to ones used to train the policy or fine-tune the base video model. In this report, we demonstrate that video models can be used for the entire spectrum of policy evaluation use cases in robotics: from assessing nominal performance to out-of-distribution (OOD) generalization, and probing physical and semantic safety. We introduce a generative evaluation system built upon a frontier video foundation model (Veo). The system is optimized to support robot action conditioning and multi-view consistency, while integrating generative image-editing and multi-view completion to synthesize realistic variations of real-world scenes along multiple axes of generalization. We demonstrate that the system preserves the base capabilities of the video model to enable accurate simulation of scenes that have been edited to include novel interaction objects, novel visual backgrounds, and novel distractor objects. This fidelity enables accurately predicting the relative performance of different policies in both nominal and OOD conditions, determining the relative impact of different axes of generalization on policy performance, and performing red teaming of policies to expose behaviors that violate physical or semantic safety constraints. We validate these capabilities through 1600+ real-world evaluations of eight Gemini Robotics policy checkpoints and five tasks for a bimanual manipulator.

翻译：生成式世界模型在模拟视觉运动策略与多样化环境交互方面展现出巨大潜力。前沿视频模型能够以可扩展且通用的方式生成逼真的观测结果与环境交互。然而，视频模型在机器人领域的应用主要局限于分布内评估，即与策略训练或基础视频模型微调所用场景相似的场景。本报告证明，视频模型可覆盖机器人策略评估的全部应用场景：从标称性能评估到分布外泛化能力测试，再到物理与语义安全性探查。我们提出了一种基于前沿视频基础模型（Veo）构建的生成式评估系统。该系统经优化可支持机器人动作条件控制与多视角一致性，同时集成生成式图像编辑与多视角补全技术，能够沿多个泛化维度合成真实场景的逼真变体。实验表明，该系统保持了视频模型的基础能力，能够准确模拟经过编辑的场景——包括添加新型交互物体、新颖视觉背景及干扰物体。这种保真度使得系统能够：准确预测不同策略在标称条件与分布外条件下的相对性能；确定不同泛化维度对策略性能的相对影响；对策略进行红队测试以发现违反物理或语义安全约束的行为。我们通过对八种Gemini机器人策略检查点和五项双手操作器任务进行1600余次真实世界评估，验证了这些能力。