Recently, various studies have leveraged Large Language Models (LLMs) to help decision-making and planning in environments, and try to align the LLMs' knowledge with the world conditions. Nonetheless, the capacity of LLMs to continuously acquire environmental knowledge and adapt in an open world remains uncertain. In this paper, we propose an approach to spur LLMs to explore the open world, gather experiences, and learn to improve their task-solving capabilities. In this approach, a multi-round feedback-revision mechanism is utilized to encourage LLMs to actively select appropriate revision actions guided by feedback information from the environment. This facilitates exploration and enhances the model's performance. Besides, we integrate sub-task relabeling to assist LLMs in maintaining consistency in sub-task planning and help the model learn the combinatorial nature between tasks, enabling it to complete a wider range of tasks through training based on the acquired exploration experiences. By evaluation in Minecraft, an open-ended sandbox world, we demonstrate that our approach LLaMA-Rider enhances the efficiency of the LLM in exploring the environment, and effectively improves the LLM's ability to accomplish more tasks through fine-tuning with merely 1.3k instances of collected data, showing minimal training costs compared to the baseline using reinforcement learning.
翻译:近期,多项研究利用大型语言模型(LLMs)辅助环境中的决策与规划,并试图将LLMs的知识与世界条件对齐。然而,LLMs在开放世界中持续获取环境知识并适应的能力仍不确定。本文提出一种方法,以激发LLMs探索开放世界、积累经验并学习提升其任务解决能力。该方法采用多轮反馈-修正机制,鼓励LLMs根据环境中的反馈信息主动选择适当的修正动作,从而促进探索并增强模型性能。此外,我们整合了子任务重标注技术,以帮助LLMs保持子任务规划的一致性,并辅助模型学习任务间的组合特性,使其能够基于获得的探索经验通过训练完成更广泛的任务。通过在开放式沙盒世界Minecraft中的评估,我们证明本方法LLAMA-Rider能提升LLMs探索环境的效率,并有效改善LLMs完成更多任务的能力——仅需收集1.3k条数据实例进行微调,与使用强化学习的基线相比,训练成本极低。