DCP-Bench-Open: Evaluating LLMs for Constraint Modelling of Discrete Combinatorial Problems

from arxiv, This version is currently submitted and it is under review. For CP-Bench (the paper accepted at ECAI25), please refer to the previous version of this entry (v2)

Discrete Combinatorial Problems (DCPs) are prevalent in industrial decision-making and optimisation. However, while constraint solving technologies for DCPs have advanced significantly, the core process of formalising them, namely constraint modelling, requires significant expertise and remains a bottleneck for wider adoption. Aiming to alleviate this bottleneck, recent studies have explored using Large Language Models (LLMs) to transform combinatorial problem descriptions into executable constraint models. However, the existing evaluation datasets for discrete constraint modelling are often limited to small, homogeneous, or domain-specific problems, which do not capture the diversity of real-world scenarios. This work addresses this gap by introducing DCP-Bench-Open, a novel benchmark that includes a diverse set of well-known discrete combinatorial problems sourced from the Constraint Programming (CP) and Operations Research (OR) communities, structured explicitly for evaluating LLM-driven constraint modelling. With this dataset, and given the variety of modelling frameworks, we compare and evaluate the modelling capabilities of LLMs for three distinct constraint modelling systems, which vary in abstraction level and underlying syntax. Notably, the results show higher performance when modelling with a high-level Python-based framework. Additionally, we systematically evaluate the use of prompt-based and inference-time compute methods across different LLMs, which further increase accuracy, reaching up to 91% on this highly challenging benchmark. DCP-Bench-Open is publicly available.

翻译：离散组合问题在工业决策与优化中普遍存在。然而，尽管针对此类问题的约束求解技术已取得显著进展，但其形式化的核心过程——即约束建模——仍需大量专业知识，这仍是阻碍其更广泛应用的瓶颈。为缓解这一瓶颈，近期研究探索利用大语言模型将组合问题描述转化为可执行的约束模型。然而，现有离散约束建模评估数据集通常局限于小型、同质或特定领域的问题，未能涵盖现实场景的多样性。本研究通过引入DCP-Bench-Open填补了这一空白，这是一个新颖的基准测试集，包含一系列源自约束规划与运筹学社区的知名离散组合问题，其结构明确设计用于评估大语言模型驱动的约束建模。基于该数据集，并考虑到建模框架的多样性，我们针对三种抽象层次与底层语法各异的约束建模系统，比较并评估了大语言模型的建模能力。值得注意的是，结果表明，在使用基于Python的高层框架进行建模时，性能更高。此外，我们系统评估了不同大语言模型中基于提示与推理时计算方法的运用，这些方法进一步提升了准确率，在这一极具挑战性的基准测试中最高可达91%。DCP-Bench-Open已公开提供。