This study focuses on using large language models (LLMs) as a planner for embodied agents that can follow natural language instructions to complete complex tasks in a visually-perceived environment. The high data cost and poor sample efficiency of existing methods hinders the development of versatile agents that are capable of many tasks and can learn new tasks quickly. In this work, we propose a novel method, LLM-Planner, that harnesses the power of large language models to do few-shot planning for embodied agents. We further propose a simple but effective way to enhance LLMs with physical grounding to generate and update plans that are grounded in the current environment. Experiments on the ALFRED dataset show that our method can achieve very competitive few-shot performance: Despite using less than 0.5% of paired training data, LLM-Planner achieves competitive performance with recent baselines that are trained using the full training data. Existing methods can barely complete any task successfully under the same few-shot setting. Our work opens the door for developing versatile and sample-efficient embodied agents that can quickly learn many tasks. Website: https://dki-lab.github.io/LLM-Planner
翻译:本研究聚焦于利用大语言模型作为具身智能体的规划器,使其能够遵循自然语言指令在视觉感知环境中完成复杂任务。现有方法存在数据成本高、样本效率低的问题,阻碍了能处理多种任务且能快速学习新任务的通用型智能体开发。我们提出新型方法LLM-Planner,借助大语言模型为具身智能体实现小样本规划。进一步提出简洁有效的物理基础增强方法,使LLM能够生成并更新基于当前环境的基础规划。在ALFRED数据集上的实验表明,本方法能达到极具竞争力的小样本性能:即便仅使用不到0.5%的配对训练数据,LLM-Planner仍能与使用完整训练数据的最新基线模型竞争。现有方法在相同小样本设定下几乎无法成功完成任何任务。本工作为开发样本高效、能快速学习多任务的通用具身智能体开辟了新路径。网站:https://dki-lab.github.io/LLM-Planner