Recently, video-based world models that learn to simulate the dynamics have gained increasing attention in robot learning. However, current approaches primarily emphasize visual generative quality while overlooking physical fidelity, dynamic consistency, and task logic, especially for contact-rich manipulation tasks, which limits their applicability to downstream tasks. To this end, we introduce ReWorld, a framework aimed to employ reinforcement learning to align the video-based embodied world models with physical realism, task completion capability, embodiment plausibility and visual quality. Specifically, we first construct a large-scale (~235K) video preference dataset and employ it to train a hierarchical reward model designed to capture multi-dimensional reward consistent with human preferences. We further propose a practical alignment algorithm that post-trains flow-based world models using this reward through a computationally efficient PPO-style algorithm. Comprehensive experiments and theoretical analysis demonstrate that ReWorld significantly improves the physical fidelity, logical coherence, embodiment and visual quality of generated rollouts, outperforming previous methods.
翻译:近年来,基于视频的世界模型通过学习模拟环境动态,在机器人学习领域日益受到关注。然而,现有方法主要侧重于视觉生成质量,而忽视了物理保真度、动态一致性与任务逻辑性,尤其在接触丰富的操作任务中,这限制了其在下游任务中的应用。为此,我们提出ReWorld框架,旨在利用强化学习将基于视频的具身世界模型与物理真实性、任务完成能力、具身合理性及视觉质量对齐。具体而言,我们首先构建了一个大规模(约235K)视频偏好数据集,并利用其训练一个分层奖励模型,该模型旨在捕获与人类偏好一致的多维度奖励。我们进一步提出一种实用的对齐算法,通过计算高效的PPO风格算法,利用该奖励对基于流的世界模型进行后训练。全面的实验与理论分析表明,ReWorld显著提升了生成推演的物理保真度、逻辑连贯性、具身合理性及视觉质量,性能优于先前方法。