Recent advances in world models have shown promise for modeling future dynamics of environmental states, enabling agents to reason and act without accessing real environments. Current methods mainly perform single-step or fixed-horizon rollouts, leaving their potential for complex task planning under-exploited. We propose Imagine-then-Plan (\texttt{ITP}), a unified framework for agent learning via lookahead imagination, where an agent's policy model interacts with the learned world model, yielding multi-step ``imagined'' trajectories. Since the imagination horizon may vary by tasks and stages, we introduce a novel adaptive lookahead mechanism by trading off the ultimate goal and task progress. The resulting imagined trajectories provide rich signals about future consequences, such as achieved progress and potential conflicts, which are fused with current observations, formulating a partially \textit{observable} and \textit{imaginable} Markov decision process to guide policy learning. We instantiate \texttt{ITP} with both training-free and reinforcement-trained variants. Extensive experiments across representative agent benchmarks demonstrate that \texttt{ITP} significantly outperforms competitive baselines. Further analyses validate that our adaptive lookahead largely enhances agents' reasoning capability, providing valuable insights into addressing broader, complex tasks. Our code and data will be publicly available at https://github.com/loyiv/ITP.
翻译:世界模型的最新进展为建模环境状态的未来动态提供了前景,使智能体能够在无需访问真实环境的情况下进行推理与行动。现有方法主要执行单步或固定步长的轨迹推演,其在复杂任务规划方面的潜力尚未得到充分挖掘。我们提出“想象而后规划”(\texttt{ITP}),一个通过前瞻想象进行智能体学习的统一框架。在该框架中,智能体的策略模型与习得的世界模型交互,生成多步“想象”轨迹。由于想象步长可能因任务和阶段而异,我们引入一种新颖的自适应前瞻机制,通过权衡最终目标与任务进展来实现。由此产生的想象轨迹提供了关于未来结果的丰富信号,例如已实现的进展与潜在冲突,这些信号与当前观测融合,构建了一个部分可观测且可想象的马尔可夫决策过程,以指导策略学习。我们通过免训练和强化学习训练两种变体实现了\texttt{ITP}。在多个代表性智能体基准测试上的广泛实验表明,\texttt{ITP}显著优于现有竞争基线。进一步的分析验证了我们的自适应前瞻机制极大增强了智能体的推理能力,为应对更广泛、更复杂的任务提供了有价值的见解。我们的代码与数据将在 https://github.com/loyiv/ITP 公开。