World models are becoming central to robotic planning and control as they enable prediction of future state transitions. Existing approaches often emphasize video generation or natural-language prediction, which are difficult to ground in robot actions and suffer from compounding errors over long horizons. Classic task and motion planning models world transitions in logical space, enabling robot-executable and robust long-horizon reasoning. However, they typically operate independently of visual perception, preventing synchronized symbolic and visual state prediction. We propose a Hierarchical World Model (H-WM) that jointly predicts logical and visual state transitions within a unified framework. H-WM combines a high-level logical world model with a low-level visual world model, integrating the long-horizon robustness of symbolic reasoning with visual grounding. The hierarchical outputs provide stable intermediate guidance for long-horizon tasks, mitigating error accumulation and enabling robust execution across extended task sequences. Experiments across multiple vision-language-action (VLA) control policies demonstrate the effectiveness and generality of H-WM's guidance.
翻译:世界模型正逐渐成为机器人规划与控制的核心,因其能够预测未来的状态转移。现有方法通常侧重于视频生成或自然语言预测,这些方法难以与机器人动作建立关联,且在长时域任务中易产生累积误差。经典的任务与运动规划方法在逻辑空间中建模世界状态转移,实现了机器人可执行且鲁棒的长时域推理,但其通常独立于视觉感知运行,导致符号状态与视觉状态的预测无法同步。本文提出一种分层世界模型(H-WM),能够在统一框架内联合预测逻辑状态与视觉状态的转移。H-WM将高层逻辑世界模型与低层视觉世界模型相结合,融合了符号推理的长时域鲁棒性与视觉接地性。其分层输出为长时域任务提供了稳定的中间层引导,有效缓解了误差累积问题,并支持跨多步任务序列的鲁棒执行。在多种视觉-语言-动作(VLA)控制策略上的实验验证了H-WM引导机制的有效性与普适性。