There is an increasing body of work using Large Language Models (LLMs) as agents for orchestrating workflows and making decisions in domains that require planning and multi-step reasoning. As a result, it is imperative to evaluate LLMs on core skills required for planning. In this work, we present ACPBench, a benchmark for evaluating the reasoning tasks in the field of planning. The benchmark consists of 7 reasoning tasks over 13 planning domains. The collection is constructed from planning domains described in a formal language. This allows us to synthesize problems with provably correct solutions across many tasks and domains. Further, it allows us the luxury of scale without additional human effort, i.e., many additional problems can be created automatically. Our extensive evaluation of 22 LLMs and OpenAI o1 reasoning models highlights the significant gap in the reasoning capability of the LLMs. Our findings with OpenAI o1, a multi-turn reasoning model, reveal significant gains in performance on multiple-choice questions, yet surprisingly, no notable progress is made on boolean questions. The ACPBench collection is available at https://ibm.github.io/ACPBench.
翻译:近年来,越来越多的工作利用大语言模型(LLMs)作为智能体,在需要规划和多步推理的领域中编排工作流并做出决策。因此,评估LLMs在规划所需核心技能上的表现变得至关重要。本文中,我们提出了ACPBench,一个用于评估规划领域推理任务的基准测试。该基准包含覆盖13个规划领域的7类推理任务。该数据集基于形式化语言描述的规划领域构建而成,这使得我们能够在众多任务和领域中合成具有可证明正确解的问题。此外,它使我们能够在不增加人工成本的情况下实现规模扩展,即可以自动生成大量额外问题。我们对22种LLM及OpenAI o1推理模型进行了广泛评估,结果凸显了LLMs在推理能力上存在的显著差距。我们对多轮推理模型OpenAI o1的测试发现,其在多项选择题上的表现有显著提升,但令人意外的是,在布尔型问题上并未取得明显进展。ACPBench数据集可通过 https://ibm.github.io/ACPBench 获取。