Embodied world models have emerged as a promising paradigm in robotics, most of which leverage large-scale Internet videos or pretrained video generation models to enrich visual and motion priors. However, they still face key challenges: a misalignment between coordinate-space actions and pixel-space videos, sensitivity to camera viewpoint, and non-unified architectures across embodiments. To this end, we present BridgeV2W, which converts coordinate-space actions into pixel-aligned embodiment masks rendered from the URDF and camera parameters. These masks are then injected into a pretrained video generation model via a ControlNet-style pathway, which aligns the action control signals with predicted videos, adds view-specific conditioning to accommodate camera viewpoints, and yields a unified world model architecture across embodiments. To mitigate overfitting to static backgrounds, BridgeV2W further introduces a flow-based motion loss that focuses on learning dynamic and task-relevant regions. Experiments on single-arm (DROID) and dual-arm (AgiBot-G1) datasets, covering diverse and challenging conditions with unseen viewpoints and scenes, show that BridgeV2W improves video generation quality compared to prior state-of-the-art methods. We further demonstrate the potential of BridgeV2W on downstream real-world tasks, including policy evaluation and goal-conditioned planning. More results can be found on our project website at https://BridgeV2W.github.io .
翻译:具身世界模型已成为机器人学中一种前景广阔的范式,其中大多数利用大规模互联网视频或预训练的视频生成模型来丰富视觉和运动先验。然而,它们仍面临关键挑战:坐标空间动作与像素空间视频之间的错位、对相机视角的敏感性,以及跨具身形态的非统一架构。为此,我们提出了BridgeV2W,它将坐标空间动作转换为从URDF和相机参数渲染出的像素对齐的具身掩码。这些掩码随后通过一个ControlNet风格的路径注入到预训练的视频生成模型中,该路径将动作控制信号与预测视频对齐,添加视角特定的条件以适应相机视点,并产生一个跨具身形态的统一世界模型架构。为了减轻对静态背景的过拟合,BridgeV2W进一步引入了一种基于光流的运动损失,专注于学习动态和任务相关区域。在单臂(DROID)和双臂(AgiBot-G1)数据集上的实验,涵盖了具有未见视角和场景的多样且具有挑战性的条件,表明BridgeV2W相较于先前最先进的方法提高了视频生成质量。我们进一步展示了BridgeV2W在下游现实世界任务上的潜力,包括策略评估和目标条件规划。更多结果可在我们的项目网站 https://BridgeV2W.github.io 上找到。