Automating the translation of natural-language specifications into logic programs is a challenging task that affects neurosymbolic engineering. We present ASP-Bench, a benchmark comprising 128 natural language problem instances, 64 base problems with easy and hard variants. It evaluates systems that translate natural-language problems into Answer Set Programs (ASPs), a prominent form of logic programming. It provides systematic coverage of ASP features, including choice rules, aggregates, and optimization. Each problem includes reference validators that check whether solutions satisfy the problem specification. We characterize problems along seven largely independent reasoning aspects (optimization, temporal reasoning, default logic, resource allocation, recursion, spatial reasoning, and quantitative complexity), providing a multidimensional view of modeling difficulty. We test the benchmark using an agentic approach based on the ReAct (Reason and Act) framework, which achieves full saturation, demonstrating that feedback-driven iterative refinement with solver feedback provides a reliable and robust approach for modeling natural language in ASP. Our analysis across multiple agent runs enables us to gain insights into what determines a problem's modeling hardness.
翻译:将自然语言规约自动翻译为逻辑程序是一项影响神经符号工程的挑战性任务。我们提出ASP-Bench,这是一个包含128个自然语言问题实例的基准测试集,其中包含64个基础问题及其简单与困难变体。该基准用于评估将自然语言问题转化为答案集程序(ASP)——一种重要逻辑编程形式——的系统。它系统性地覆盖了ASP特性,包括选择规则、聚合与优化。每个问题均附带参考验证器,用于检验解是否满足问题规约。我们通过七个基本独立的推理维度(优化、时序推理、默认逻辑、资源分配、递归、空间推理与定量复杂性)对问题进行特征刻画,从而提供建模难度的多维视角。我们采用基于ReAct(推理与行动)框架的智能体方法对基准进行测试,该方法实现了完全饱和,证明结合求解器反馈的驱动式迭代精化能为ASP中的自然语言建模提供可靠且鲁棒的方法。我们在多次智能体运行中的分析,使我们能够深入理解决定问题建模难度的关键因素。