Virtual environments play a key role in benchmarking advances in complex planning and decision-making tasks but are expensive and complicated to build by hand. Can current language models themselves serve as world simulators, correctly predicting how actions change different world states, thus bypassing the need for extensive manual coding? Our goal is to answer this question in the context of text-based simulators. Our approach is to build and use a new benchmark, called ByteSized32-State-Prediction, containing a dataset of text game state transitions and accompanying game tasks. We use this to directly quantify, for the first time, how well LLMs can serve as text-based world simulators. We test GPT-4 on this dataset and find that, despite its impressive performance, it is still an unreliable world simulator without further innovations. This work thus contributes both new insights into current LLM's capabilities and weaknesses, as well as a novel benchmark to track future progress as new models appear.
翻译:虚拟环境在复杂规划与决策任务的基准测试中发挥着关键作用,但手工构建成本高昂且过程复杂。当前的语言模型本身能否作为世界模拟器,准确预测行动如何改变不同的世界状态,从而避免大量手动编码的需求?本研究旨在基于文本模拟器的背景回答这一问题。我们构建并使用了名为ByteSized32-State-Prediction的新基准,该基准包含文本游戏状态转换数据集及相应的游戏任务。通过该基准,我们首次直接量化了LLM作为基于文本的世界模拟器的性能。我们在该数据集上测试了GPT-4,发现尽管其表现令人印象深刻,但在缺乏进一步创新的情况下,仍无法成为可靠的世界模拟器。因此,本研究既为当前LLM的能力与局限提供了新的见解,也提出了可追踪未来模型进展的新型基准。