Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages

Many recent works have explored using language models for planning problems. One line of research focuses on translating natural language descriptions of planning tasks into structured planning languages, such as the planning domain definition language (PDDL). While this approach is promising, accurately measuring the quality of generated PDDL code continues to pose significant challenges. First, generated PDDL code is typically evaluated using planning validators that check whether the problem can be solved with a planner. This method is insufficient because a language model might generate valid PDDL code that does not align with the natural language description of the task. Second, existing evaluation sets often have natural language descriptions of the planning task that closely resemble the ground truth PDDL, reducing the challenge of the task. To bridge this gap, we introduce \benchmarkName, a benchmark designed to evaluate language models' ability to generate PDDL code from natural language descriptions of planning tasks. We begin by creating a PDDL equivalence algorithm that rigorously evaluates the correctness of PDDL code generated by language models by flexibly comparing it against a ground truth PDDL. Then, we present a dataset of $132,037$ text-to-PDDL pairs across 13 different tasks, with varying levels of difficulty. Finally, we evaluate several API-access and open-weight language models that reveal this task's complexity. For example, $87.6\%$ of the PDDL problem descriptions generated by GPT-4o are syntactically parseable, $82.2\%$ are valid, solve-able problems, but only $35.1\%$ are semantically correct, highlighting the need for a more rigorous benchmark for this problem.

翻译：近年来，众多研究探索了利用语言模型解决规划问题。其中一条研究主线聚焦于将规划任务的自然语言描述翻译为结构化规划语言，例如规划领域定义语言（PDDL）。尽管该方法前景广阔，但准确评估生成的PDDL代码质量仍面临重大挑战。首先，生成的PDDL代码通常使用规划验证器进行评估，这类验证器仅检查问题是否可通过规划器求解。此方法存在不足，因为语言模型可能生成语法有效但不符合任务自然语言描述的PDDL代码。其次，现有评估集中的规划任务自然语言描述往往与真实PDDL高度相似，降低了任务的挑战性。为弥补这一差距，我们引入\benchmarkName，这是一个旨在评估语言模型根据规划任务自然语言描述生成PDDL代码能力的基准。我们首先创建了一种PDDL等价性算法，通过灵活对比生成代码与真实PDDL，严格评估语言模型生成PDDL代码的正确性。随后，我们构建了一个包含13项不同任务、共计$132,037$个文本-PDDL对的数据集，其难度层次各异。最后，我们对多个API访问型和开源权重的语言模型进行评估，结果揭示了该任务的复杂性。例如，GPT-4o生成的PDDL问题描述中，$87.6\%$在句法上可解析，$82.2\%$构成有效且可求解的问题，但仅$35.1\%$在语义上完全正确，这凸显了该领域需要更严格基准的必要性。