Video-based world models offer a powerful paradigm for embodied simulation and planning, yet state-of-the-art models often generate physically implausible manipulations - such as object penetration and anti-gravity motion - due to training on generic visual data and likelihood-based objectives that ignore physical laws. We present ABot-PhysWorld, a 14B Diffusion Transformer model that generates visually realistic, physically plausible, and action-controllable videos. Built on a curated dataset of three million manipulation clips with physics-aware annotation, it uses a novel DPO-based post-training framework with decoupled discriminators to suppress unphysical behaviors while preserving visual quality. A parallel context block enables precise spatial action injection for cross-embodiment control. To better evaluate generalization, we introduce EZSbench, the first training-independent embodied zero-shot benchmark combining real and synthetic unseen robot-task-scene combinations. It employs a decoupled protocol to separately assess physical realism and action alignment. ABot-PhysWorld achieves new state-of-the-art performance on PBench and EZSbench, surpassing Veo 3.1 and Sora v2 Pro in physical plausibility and trajectory consistency. We will release EZSbench to promote standardized evaluation in embodied video generation.
翻译:基于视频的世界模型为具身仿真与规划提供了强大范式,然而现有模型因在通用视觉数据上训练且采用忽略物理规律的似然优化目标,常生成不符合物理规律的操作(如物体穿透与反重力运动)。我们提出ABot-PhysWorld——一个140亿参数的扩散Transformer模型,可生成视觉逼真、物理合理且动作可控的视频。基于包含300万条带有物理感知标注的操作片段数据集,本文采用新颖的基于DPO的后训练框架与解耦判别器,在保持视觉质量的同时抑制违反物理规律的行为。并行上下文模块实现了跨形态控制的精确空间动作注入。为更好评估泛化能力,我们引入首个独立于训练数据的具身零样本基准EZSbench,整合真实与合成的未知机器人-任务-场景组合。该基准采用解耦式评估协议分别衡量物理真实性与动作对齐度。ABot-PhysWorld在PBench与EZSbench上均达到新最优性能,在物理合理性与轨迹一致性上超越Veo 3.1和Sora v2 Pro。我们将开源EZSbench以推动具身视频生成的标准化评估。