EVA: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards

Video generative models are increasingly used as world models for robotics, where a model generates a future visual rollout conditioned on the current observation and task instruction, and an inverse dynamics model (IDM) converts the generated frames into executable robot actions. However, current video world models lack explicit executability constraints. As a result, visually coherent rollouts may still violate rigid-body and kinematic consistency, producing unstable or infeasible control commands when decoded by an IDM. We refer to this mismatch between visual generation and physically executable control as the executability gap. While this gap can be mitigated at inference time using techniques such as rejection sampling, such approaches are inefficient due to the high cost of video generation. In this paper, we leverage the executability gap as a training signal and introduce Executable Video Alignment (EVA), a reinforcement-learning post-training framework for aligning video world models. EVA trains an inverse dynamics model on real robot trajectories and repurposes it as a reward model that evaluates generated videos through the action sequences they induce, encouraging smooth motions measured by velocity, acceleration, and jerk while penalizing actions that violate embodiment constraints. Importantly, the reward remains informative even when generated videos contain severe visual artifacts, since such artifacts typically translate into unstable or out-of-bound actions. Experiments on the RoboTwin benchmark and a real bimanual robot show that EVA reduces embodiment-specific artifacts in generated rollouts and improves downstream task execution success.

翻译：视频生成模型越来越多地被用作机器人领域的“世界模型”，其中模型根据当前观测和任务指令生成未来的视觉展开，而逆动力学模型（IDM）则将这些生成的帧转换为可执行的机器人动作。然而，当前的视频世界模型缺乏明确的“可执行性”约束。因此，视觉上连贯的展开仍可能违反刚体与运动学一致性，在由IDM解码时产生不稳定或不可行的控制指令。我们将这种视觉生成与物理可执行控制之间的不匹配称为“可执行性差距”。虽然在推理阶段可以通过拒绝采样等技术缓解这一差距，但此类方法由于视频生成的高成本而效率低下。在本文中，我们利用可执行性差距作为训练信号，并提出了“可执行视频对齐”（EVA），一种用于对齐视频世界模型的强化学习后训练框架。EVA在真实机器人轨迹上训练逆动力学模型，并将其重新用作奖励模型——通过生成的视频所激发的动作序列来评估该视频，鼓励以速度、加速度和急动度度量的平滑运动，同时惩罚违反具身约束的动作。关键在于，即使生成的视频包含严重的视觉伪影，该奖励仍能提供有效信息，因为这类伪影通常会导致不稳定或超出界限的动作。在RoboTwin基准测试以及真实双臂机器人上的实验表明，EVA减少了生成展开中的具身特定伪影，并提高了下游任务的执行成功率。