Agents built on vision-language models increasingly face tasks that demand anticipating future states rather than relying on short-horizon reasoning. Generative world models offer a promising remedy: agents could use them as external simulators to foresee outcomes before acting. This paper empirically examines whether current agents can leverage such world models as tools to enhance their cognition. Across diverse agentic and visual question answering tasks, we observe that some agents rarely invoke simulation (fewer than 1%), frequently misuse predicted rollouts (approximately 15%), and often exhibit inconsistent or even degraded performance (up to 5%) when simulation is available or enforced. Attribution analysis further indicates that the primary bottleneck lies in the agents' capacity to decide when to simulate, how to interpret predicted outcomes, and how to integrate foresight into downstream reasoning. These findings underscore the need for mechanisms that foster calibrated, strategic interaction with world models, paving the way toward more reliable anticipatory cognition in future agent systems.
翻译:基于视觉-语言模型构建的智能体日益面临需要预测未来状态而非依赖短视距推理的任务。生成式世界模型提供了一种有前景的解决方案:智能体可将其作为外部模拟器,在行动前预判结果。本文通过实证研究检验当前智能体能否利用此类世界模型作为工具以增强其认知能力。在多样化的智能体任务与视觉问答任务中,我们观察到部分智能体极少调用模拟(低于1%),频繁误用预测推演(约15%),且在模拟功能可用或被强制启用时,常表现出不一致甚至性能下降(最高达5%)。归因分析进一步表明,主要瓶颈在于智能体缺乏以下能力:判断何时进行模拟、如何解读预测结果,以及如何将前瞻性认知整合至下游推理过程。这些发现凸显了需要建立促进智能体与世界模型进行校准化、战略性交互的机制,从而为未来智能体系统实现更可靠的预期认知铺平道路。