Evaluating robot control policies is difficult: real-world testing is costly, and handcrafted simulators require manual effort to improve in realism and generality. We propose a world-model-based policy evaluation environment (WorldGym), an autoregressive, action-conditioned video generation model which serves as a proxy to real world environments. Policies are evaluated via Monte Carlo rollouts in the world model, with a vision-language model providing rewards. We evaluate a set of VLA-based real-robot policies in the world model using only initial frames from real robots, and show that policy success rates within the world model highly correlate with real-world success rates. Moreoever, we show that WorldGym is able to preserve relative policy rankings across different policy versions, sizes, and training checkpoints. Due to requiring only a single start frame as input, the world model further enables efficient evaluation of robot policies' generalization ability on novel tasks and environments. We find that modern VLA-based robot policies still struggle to distinguish object shapes and can become distracted by adversarial facades of objects. While generating highly realistic object interaction remains challenging, WorldGym faithfully emulates robot motions and offers a practical starting point for safe and reproducible policy evaluation before deployment.
翻译:评估机器人控制策略是困难的:真实世界测试成本高昂,而手工构建的仿真器需要大量人工努力来提升其真实性与泛化能力。我们提出了一种基于世界模型的策略评估环境(WorldGym),它是一个自回归的、动作条件化的视频生成模型,可作为真实世界环境的代理。策略通过在世界模型中进行蒙特卡洛推演来评估,并由一个视觉-语言模型提供奖励。我们仅使用来自真实机器人的初始帧,在世界模型中评估了一组基于VLA的真实机器人策略,结果表明世界模型内的策略成功率与真实世界成功率高度相关。此外,我们还证明WorldGym能够在不同策略版本、模型大小和训练检查点之间保持相对策略排名的稳定性。由于仅需单张起始帧作为输入,该世界模型进一步支持高效评估机器人策略在新任务和新环境上的泛化能力。我们发现,基于现代VLA的机器人策略仍然难以区分物体形状,并且可能被物体的对抗性外观所干扰。尽管生成高度真实的物体交互仍然具有挑战性,但WorldGym忠实地模拟了机器人运动,并为部署前进行安全、可复现的策略评估提供了一个实用的起点。