LLMs are being increasingly used for planning-style tasks, but their capabilities for planning and reasoning are poorly understood. We present AutoPlanBench, a novel method for automatically converting planning benchmarks written in PDDL into textual descriptions and offer a benchmark dataset created with our method. We show that while the best LLM planners do well on some planning tasks, others remain out of reach of current methods.
翻译:大型语言模型正越来越多地被用于规划类任务,但对其规划和推理能力的理解仍十分有限。我们提出了AutoPlanBench——一种从PDDL编写的规划基准测试中自动生成文本描述的新方法,并利用该方法创建了一个基准测试数据集。研究表明,尽管最先进的LLM规划器在某些规划任务上表现良好,但仍有部分任务超出了现有方法的能力范围。