Language Models Meet World Models: Embodied Experiences Enhance Language Models

While large language models (LMs) have shown remarkable capabilities across numerous tasks, they often struggle with simple reasoning and planning in physical environments, such as understanding object permanence or planning household activities. The limitation arises from the fact that LMs are trained only on written text and miss essential embodied knowledge and skills. In this paper, we propose a new paradigm of enhancing LMs by finetuning them with world models, to gain diverse embodied knowledge while retaining their general language capabilities. Our approach deploys an embodied agent in a world model, particularly a simulator of the physical world (VirtualHome), and acquires a diverse set of embodied experiences through both goal-oriented planning and random exploration. These experiences are then used to finetune LMs to teach diverse abilities of reasoning and acting in the physical world, e.g., planning and completing goals, object permanence and tracking, etc. Moreover, it is desirable to preserve the generality of LMs during finetuning, which facilitates generalizing the embodied knowledge across tasks rather than being tied to specific simulations. We thus further introduce the classical (EWC) for selective weight updates, combined with low-rank adapters (LoRA) for training efficiency. Extensive experiments show our approach substantially improves base LMs on 18 downstream tasks by 64.28% on average. In particular, the small LMs (1.3B, 6B, and 13B) enhanced by our approach match or even outperform much larger LMs (e.g., ChatGPT).

翻译：尽管大型语言模型（LMs）在众多任务中展现出卓越能力，但在物理环境中的简单推理与规划（如理解物体恒存性或规划家务活动）方面仍存在困难。这一局限性源于语言模型仅基于书面文本训练，缺乏必要的具身知识与技能。本文提出一种新范式：通过世界模型对语言模型进行微调，在保持其通用语言能力的同时获取多样化的具身知识。我们的方法将具身智能体部署于世界模型（特别是物理世界模拟器VirtualHome）中，通过目标导向规划与随机探索获取多样化的具身经验。这些经验随后被用于微调语言模型，以传授物理世界中推理与行动的多样化能力（例如规划与完成目标、物体恒存性追踪等）。此外，在微调过程中保持语言模型的通用性至关重要，这有助于具身知识跨任务泛化而避免局限于特定模拟场景。为此，我们进一步引入经典弹性权重巩固（EWC）方法实现选择性权重更新，并结合低秩适配器（LoRA）提升训练效率。大量实验表明，我们的方法使基础语言模型在18项下游任务中平均提升64.28%。特别值得注意的是，经本方法增强的小型语言模型（1.3B、6B及13B参数规模）表现可媲美甚至超越更大规模的语言模型（如ChatGPT）。