This study focuses on using large language models (LLMs) as a planner for embodied agents that can follow natural language instructions to complete complex tasks in a visually-perceived environment. The high data cost and poor sample efficiency of existing methods hinders the development of versatile agents that are capable of many tasks and can learn new tasks quickly. In this work, we propose a novel method, LLM-Planner, that harnesses the power of large language models to do few-shot planning for embodied agents. We further propose a simple but effective way to enhance LLMs with physical grounding to generate and update plans that are grounded in the current environment. Experiments on the ALFRED dataset show that our method can achieve very competitive few-shot performance: Despite using less than 0.5% of paired training data, LLM-Planner achieves competitive performance with recent baselines that are trained using the full training data. Existing methods can barely complete any task successfully under the same few-shot setting. Our work opens the door for developing versatile and sample-efficient embodied agents that can quickly learn many tasks. Website: https://dki-lab.github.io/LLM-Planner
翻译:本研究聚焦于利用大型语言模型作为具身智能体的规划器,使其能遵循自然语言指令在视觉感知环境中完成复杂任务。现有方法存在数据成本高、样本效率低的问题,阻碍了能够执行多种任务并快速学习新任务的通用型智能体的发展。为此,我们提出一种新方法——LLM-Planner——利用大型语言模型的能力对具身智能体进行少样本规划。我们进一步提出一种简单而有效的方式,通过物理接地增强大型语言模型,使其生成并更新与当前环境相契合的规划。在ALFRED数据集上的实验表明,我们的方法取得了极具竞争力的少样本性能:尽管仅使用不到0.5%的配对训练数据,LLM-Planner仍能达到与使用完整训练数据的近期基线方法相媲美的性能。在相同少样本设置下,现有方法几乎无法成功完成任何任务。本工作为开发样本高效、能快速学习多种任务的通用型具身智能体开辟了新途径。网站:https://dki-lab.github.io/LLM-Planner