WorldGym：作为策略评估环境的世界模型 (WorldGym: World Model as An Environment for Policy Evaluation)

Evaluating robot control policies is difficult: real-world testing is costly, and handcrafted simulators require manual effort to improve in realism and generality. We propose a world-model-based policy evaluation environment (WorldGym), an autoregressive, action-conditioned video generation model which serves as a proxy to real world environments. Policies are evaluated via Monte Carlo rollouts in the world model, with a vision-language model providing rewards. We evaluate a set of VLA-based real-robot policies in the world model using only initial frames from real robots, and show that policy success rates within the world model highly correlate with real-world success rates. Moreoever, we show that WorldGym is able to preserve relative policy rankings across different policy versions, sizes, and training checkpoints. Due to requiring only a single start frame as input, the world model further enables efficient evaluation of robot policies' generalization ability on novel tasks and environments. We find that modern VLA-based robot policies still struggle to distinguish object shapes and can become distracted by adversarial facades of objects. While generating highly realistic object interaction remains challenging, WorldGym faithfully emulates robot motions and offers a practical starting point for safe and reproducible policy evaluation before deployment.

翻译：评估机器人控制策略是困难的：真实世界测试成本高昂，而手工构建的仿真器需要大量人工努力来提升其真实性与泛化能力。我们提出了一种基于世界模型的策略评估环境（WorldGym），它是一个自回归的、动作条件化的视频生成模型，可作为真实世界环境的代理。策略通过在世界模型中进行蒙特卡洛推演来评估，并由一个视觉-语言模型提供奖励。我们仅使用来自真实机器人的初始帧，在世界模型中评估了一组基于VLA的真实机器人策略，结果表明世界模型内的策略成功率与真实世界成功率高度相关。此外，我们还证明WorldGym能够在不同策略版本、模型大小和训练检查点之间保持相对策略排名的稳定性。由于仅需单张起始帧作为输入，该世界模型进一步支持高效评估机器人策略在新任务和新环境上的泛化能力。我们发现，基于现代VLA的机器人策略仍然难以区分物体形状，并且可能被物体的对抗性外观所干扰。尽管生成高度真实的物体交互仍然具有挑战性，但WorldGym忠实地模拟了机器人运动，并为部署前进行安全、可复现的策略评估提供了一个实用的起点。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日