Large Language Models (LLMs) can serve as world models to enhance agent decision-making in digital environments by simulating future states and predicting action outcomes, potentially eliminating costly trial-and-error exploration. However, this capability is fundamentally limited by LLMs' tendency toward hallucination and their reliance on static training knowledge, which can lead to compounding errors that inhibit long-horizon simulations. To systematically investigate whether LLMs are appropriate for world modeling, we probe two core capabilities of world models--future state prediction and reward estimation--through three tasks: next-state identification, full-procedure planning alignment, and milestone transition recognition. Our analysis shows that while LLMs effectively capture immediate next states and identify meaningful state transitions, their performance rapidly degrades in full-procedure planning. This highlights LLMs' limitations in reliably modeling environment dynamics over long horizons. To address these limitations, we propose the Retrieval-augmented World Model (R-WoM), which grounds LLM simulations by incorporating factual, up-to-date knowledge retrieved from external tutorials. Experiments show that R-WoM achieves relative improvements of up to 23.4% and 16.3% on the subsets of OSWorld and Webarena compared to baselines, with particular advantage in longer-horizon simulations.
翻译:大型语言模型(LLMs)可作为世界模型,通过模拟未来状态并预测行动结果来增强智能体在数字环境中的决策能力,从而可能消除成本高昂的试错式探索。然而,该能力从根本上受到LLMs的幻觉倾向及其对静态训练知识的依赖所限制,这可能导致误差累积,从而抑制长时程模拟。为系统性地探究LLMs是否适用于世界建模,我们通过三项任务——下一状态识别、全流程规划对齐与关键节点转移识别——来考察世界模型的两项核心能力:未来状态预测与奖励估计。我们的分析表明,尽管LLMs能有效捕捉即时下一状态并识别有意义的状态转移,但其在全流程规划中的性能迅速下降。这凸显了LLMs在长时程环境动态可靠建模方面的局限性。为应对这些局限,我们提出了检索增强世界模型(R-WoM),该模型通过整合从外部教程中检索的事实性最新知识,为LLM模拟提供依据。实验表明,在OSWorld和Webarena的子集上,R-WoM相较于基线模型分别实现了最高23.4%和16.3%的相对性能提升,且在长时程模拟中展现出显著优势。