Recent work has considered whether large language models (LLMs) can function as planners: given a task, generate a plan. We investigate whether LLMs can serve as generalized planners: given a domain and training tasks, generate a program that efficiently produces plans for other tasks in the domain. In particular, we consider PDDL domains and use GPT-4 to synthesize Python programs. We also consider (1) Chain-of-Thought (CoT) summarization, where the LLM is prompted to summarize the domain and propose a strategy in words before synthesizing the program; and (2) automated debugging, where the program is validated with respect to the training tasks, and in case of errors, the LLM is re-prompted with four types of feedback. We evaluate this approach in seven PDDL domains and compare it to four ablations and four baselines. Overall, we find that GPT-4 is a surprisingly powerful generalized planner. We also conclude that automated debugging is very important, that CoT summarization has non-uniform impact, that GPT-4 is far superior to GPT-3.5, and that just two training tasks are often sufficient for strong generalization.
翻译:近期研究探讨了大语言模型(LLMs)作为规划器的可行性:给定任务,生成规划方案。本研究进一步探究LLMs能否充当泛化规划器:给定领域与训练任务,生成可高效产生该领域内其他任务规划方案的程序。具体而言,我们基于PDDL领域,利用GPT-4合成Python程序。同时研究:(1)思维链(CoT)摘要法——引导LLM在合成程序前先概括领域特征并口头提出策略;(2)自动调试技术——以训练任务验证程序有效性,遇错误时通过四种反馈类型重新提示LLM。我们在七个PDDL领域评估该方案,并与四种消融实验及四种基线方法进行对比。总体发现:GPT-4展现出惊人的泛化规划能力。此外,实验表明自动调试至关重要,CoT摘要法的影响呈非均匀性,GPT-4显著优于GPT-3.5,且仅需两个训练任务即可实现强大的泛化性能。