Intrigued by the claims of emergent reasoning capabilities in LLMs trained on general web corpora, in this paper, we set out to investigate their planning capabilities. We aim to evaluate (1) how good LLMs are by themselves in generating and validating simple plans in commonsense planning tasks (of the type that humans are generally quite good at) and (2) how good LLMs are in being a source of heuristic guidance for other agents--either AI planners or human planners--in their planning tasks. To investigate these questions in a systematic rather than anecdotal manner, we start by developing a benchmark suite based on the kinds of domains employed in the International Planning Competition. On this benchmark, we evaluate LLMs in three modes: autonomous, heuristic and human-in-the-loop. Our results show that LLM's ability to autonomously generate executable plans is quite meager, averaging only about 3% success rate. The heuristic and human-in-the-loop modes show slightly more promise. In addition to these results, we also make our benchmark and evaluation tools available to support investigations by research community.
翻译:受大型语言模型(LLMs)在通用网络语料库上训练后展现出新兴推理能力这一说法的启发,本文旨在探究其规划能力。我们试图评估:(1)在常识性规划任务(这类任务人类通常较为擅长)中,LLMs 自身生成和验证简单规划的表现如何;(2)LLMs 作为启发式指导来源,对其他智能体(无论是人工智能规划器还是人类规划者)的规划任务能起到多大作用。为系统性地而非仅凭轶事方式研究这些问题,我们首先基于国际规划竞赛中使用的领域类型开发了一套基准测试套件。在此基准上,我们以三种模式评估 LLMs:自主模式、启发式模式和人在回路模式。结果表明,LLMs 自主生成可执行规划的能力相当薄弱,平均成功率仅约 3%。启发式模式和人在回路模式则显示出稍具前景的趋势。除这些结果外,我们还公开了基准测试及评估工具,以支持研究社区的进一步探索。