World-model-based imagine-then-act becomes a promising paradigm for robotic manipulation, yet existing approaches typically support either purely image-based forecasting or reasoning over partial 3D geometry, limiting their ability to predict complete 4D scene dynamics. This work proposes a novel embodied 4D world model that enables geometrically consistent, arbitrary-view RGBD generation: given only a single-view RGBD observation as input, the model imagines the remaining viewpoints, which can then be back-projected and fused to assemble a more complete 3D structure across time. To efficiently learn the multi-view, cross-modality generation, we explicitly design cross-view and cross-modality feature fusion that jointly encourage consistency between RGB and depth and enforce geometric alignment across views. Beyond prediction, converting generated futures into actions is often handled by inverse dynamics, which is ill-posed because multiple actions can explain the same transition. We address this with a test-time action optimization strategy that backpropagates through the generative model to infer a trajectory-level latent best matching the predicted future, and a residual inverse dynamics model that turns this trajectory prior into accurate executable actions. Experiments on three datasets demonstrate strong performance on both 4D scene generation and downstream manipulation, and ablations provide practical insights into the key design choices.
翻译:基于世界模型的“先想象后行动”已成为机器人操作领域一种前景广阔的范式,然而现有方法通常仅支持纯图像预测或基于部分三维几何的推理,限制了其预测完整四维场景动态的能力。本研究提出一种新颖的具身四维世界模型,能够实现几何一致、任意视角的RGBD生成:仅以单视角RGBD观测作为输入,该模型即可想象其余视角,随后通过反投影与融合,跨时间组装出更完整的三维结构。为高效学习多视角跨模态生成,我们显式设计了跨视角与跨模态特征融合机制,共同促进RGB与深度间的一致性,并强制实现跨视角的几何对齐。除预测外,将生成的未来状态转化为动作通常由逆动力学模型处理,但由于同一状态转移可由多种动作解释,该问题具有不适定性。我们通过一种测试时动作优化策略解决此问题:该策略通过生成模型反向传播,以推断与预测未来最佳匹配的轨迹级潜在变量;并辅以一个残差逆动力学模型,将此轨迹先验转化为精确可执行的动作。在三个数据集上的实验表明,该方法在四维场景生成与下游操作任务中均表现优异,消融研究则为关键设计选择提供了实用见解。