In Model Predictive Control (MPC), world models predict the future outcomes of various action proposals, which are then scored to guide the selection of the optimal action. For visuomotor MPC, the score function is a distance metric between a predicted image and a goal image, measured in the latent space of a pretrained vision encoder like DINO and JEPA. However, it is challenging to obtain the goal image in advance of the task execution, particularly in new environments. Additionally, conveying the goal through an image offers limited interactivity compared with natural language. In this work, we propose to learn a Grounded World Model (GWM) in a vision-language-aligned latent space. As a result, each proposed action is scored based on how close its future outcome is to the task instruction, reflected by the similarity of embeddings. This approach transforms the visuomotor MPC to a VLA that surpasses VLM-based VLAs in semantic generalization. On the proposed WISER benchmark, GWM-MPC achieves a 87% success rate on the test set comprising 288 tasks that feature unseen visual signals and referring expressions, yet remain solvable with motions demonstrated during training. In contrast, traditional VLAs achieve an average success rate of 22%, even though they overfit the training set with a 90% success rate.
翻译:在模型预测控制(MPC)中,世界模型预测各种动作方案的未来结果,随后对这些结果进行评分以指导最优动作的选择。对于视觉运动MPC,评分函数是预测图像与目标图像之间的距离度量,该度量在预训练视觉编码器(如DINO和JEPA)的潜在空间中测量。然而,在任务执行前获取目标图像具有挑战性,尤其是在新环境中。此外,与自然语言相比,通过图像传达目标提供的交互性有限。在这项工作中,我们提出在视觉-语言对齐的潜在空间中学习一种情境化世界模型(GWM)。因此,每个提议的动作根据其未来结果与任务指令的接近程度进行评分,这通过嵌入的相似性体现。该方法将视觉运动MPC转化为一种超越基于VLM的VLA(视觉-语言-动作模型)语义泛化能力的VLA。在提出的WISER基准测试中,GWM-MPC在包含288个任务的测试集上实现了87%的成功率,这些任务具有未见过的视觉信号和指代表达式,但仍可通过训练中展示的动作解决。相比之下,传统的VLA平均成功率为22%,尽管它们以90%的成功率过拟合了训练集。