Scalable embodied intelligence is constrained by the scarcity of diverse, long-horizon robotic manipulation data. Existing video world models in this domain are limited to synthesizing short clips of simple actions and often rely on manually defined trajectories. To this end, we introduce MIND-V, a cognitive hierarchical world model designed to synthesize physically plausible and logically coherent videos of long-horizon robotic manipulation. Inspired by cognitive science, MIND-V bridges high-level reasoning with pixel-level synthesis through three core components: a Semantic Reasoning Hub (SRH) that leverages a pre-trained vision-language model for task planning; a Behavioral Semantic Bridge (BSB) that translates abstract instructions into domain-invariant representations; and a Motor Video Generator (MVG) for conditional video rendering. MIND-V employs Staged Visual Future Rollouts, a test-time optimization strategy to enhance long-horizon robustness. To enforce adherence to physical laws, we introduce a GRPO reinforcement learning post-training phase guided by a novel Physical Foresight Coherence (PFC) reward. PFC leverages the V-JEPA2 world model as a physics referee to penalize implausible dynamics in the latent feature space. Experiments confirm MIND-V's SOTA performance in long-horizon simulation and its significant value for policy learning, introducing a scalable and fully autonomous framework for embodied data synthesis.
翻译:可扩展的具身智能受限于多样化、长时域机器人操作数据的稀缺性。现有视频世界模型仅能合成简单动作的短视频片段,且常依赖人工定义的轨迹。为此,我们提出MIND-V——一种认知层级世界模型,旨在合成具有物理合理性与逻辑一致性的长时域机器人操作视频。受认知科学启发,MIND-V通过三个核心组件实现高层推理与像素级合成的衔接:语义推理中枢(SRH)利用预训练视觉语言模型进行任务规划;行为语义桥梁(BSB)将抽象指令转化为领域不变表征;以及运动视频生成器(MVG)进行条件视频渲染。MIND-V采用阶段性视觉未来展开(Staged Visual Future Rollouts)测试时优化策略以增强长时域鲁棒性。为强制遵循物理定律,我们引入基于GRPO的强化学习后训练阶段,并由新型物理预见一致性(PFC)奖励函数引导。PFC利用V-JEPA2世界模型作为物理裁判,在潜在特征空间中惩罚非合理动力学现象。实验证实MIND-V在长时域仿真中达到最优性能,并对策略学习具有显著价值,为具身数据合成提供了可扩展的全自主框架。