World models are becoming central to robotic planning and control, as they enable prediction of future state transitions. Existing approaches often emphasize video generation or natural language prediction, which are difficult to directly ground in robot actions and suffer from compounding errors over long horizons. Traditional task and motion planning relies on symbolic logic world models, such as planning domains, that are robot-executable and robust for long-horizon reasoning. However, these methods typically operate independently of visual perception, preventing synchronized symbolic and perceptual state prediction. We propose a Hierarchical World Model (H-WM) that jointly predicts logical and visual state transitions within a unified bilevel framework. H-WM combines a high-level logical world model with a low-level visual world model, integrating the robot-executable, long-horizon robustness of symbolic reasoning with perceptual grounding from visual observations. The hierarchical outputs provide stable and consistent intermediate guidance for long-horizon tasks, mitigating error accumulation and enabling robust execution across extended task sequences. To train H-WM, we introduce a robotic dataset that aligns robot motion with symbolic states, actions, and visual observations. Experiments across vision-language-action (VLA) control policies demonstrate the effectiveness and generality of the approach.
翻译:世界模型正逐渐成为机器人规划与控制的核心,因其能够预测未来状态转移。现有方法通常侧重于视频生成或自然语言预测,这些方法难以直接与机器人动作关联,且在长时域规划中易产生误差累积。传统的任务与运动规划依赖符号逻辑世界模型(如规划域),这类模型具备机器人可执行性及长时域推理的鲁棒性。然而,这些方法通常独立于视觉感知运行,无法实现符号状态与感知状态的同步预测。本文提出一种分层世界模型(H-WM),该模型在统一的双层框架内联合预测逻辑状态与视觉状态的转移。H-WM将高层逻辑世界模型与底层视觉世界模型相结合,既继承了符号推理的机器人可执行性与长时域鲁棒性,又融合了视觉观测的感知基础。其分层输出为长时域任务提供稳定一致的中间指导,有效缓解误差累积,确保在扩展任务序列中的鲁棒执行。为训练H-WM,我们构建了一个将机器人运动与符号状态、动作及视觉观测对齐的机器人数据集。在视觉-语言-动作控制策略上的实验验证了该方法的有效性与泛化能力。