The ability to plan with temporal abstractions is central to intelligent decision-making. Rather than reasoning over primitive actions, we study agents that compose pre-trained policies as temporally extended actions, enabling solutions to complex tasks that no constituent alone can solve. Such compositional planning remains elusive as compounding errors in long-horizon predictions make it challenging to estimate the visitation distribution induced by sequencing policies. Motivated by the geometric policy composition framework introduced in arXiv:2206.08736, we address these challenges by learning predictive models of multi-step dynamics -- so-called jumpy world models -- that capture state occupancies induced by pre-trained policies across multiple timescales in an off-policy manner. Building on Temporal Difference Flows (arXiv:2503.09817), we enhance these models with a novel consistency objective that aligns predictions across timescales, improving long-horizon predictive accuracy. We further demonstrate how to combine these generative predictions to estimate the value of executing arbitrary sequences of policies over varying timescales. Empirically, we find that compositional planning with jumpy world models significantly improves zero-shot performance across a wide range of base policies on challenging manipulation and navigation tasks, yielding, on average, a 200% relative improvement over planning with primitive actions on long-horizon tasks.
翻译:利用时间抽象进行规划的能力是智能决策的核心。与基于原始动作进行推理不同,我们研究的是将预训练策略组合为时间扩展动作的智能体,这使得解决单一构成策略无法完成的复杂任务成为可能。然而,此类组合式规划仍然难以实现,因为长时程预测中的误差累积使得估计策略序列所诱导的状态访问分布极具挑战性。受 arXiv:2206.08736 中提出的几何策略组合框架启发,我们通过学习多步动态的预测模型——即所谓的跳跃世界模型——来应对这些挑战。该模型以离轨方式捕捉预训练策略在不同时间尺度上诱导的状态占用分布。基于时序差分流(arXiv:2503.09817),我们通过一种新颖的一致性目标来增强这些模型,该目标对齐了跨时间尺度的预测,从而提高了长时程预测的准确性。我们进一步展示了如何结合这些生成式预测来评估在不同时间尺度上执行任意策略序列的价值。实验表明,在具有挑战性的操作和导航任务上,使用跳跃世界模型进行组合式规划能显著提升多种基础策略的零样本性能,在长时程任务上,其性能平均比基于原始动作的规划相对提升了200%。