ConstraintBench: Benchmarking LLM Constraint Reasoning on Direct Optimization

Large language models are increasingly applied to operational decision-making where the underlying structure is constrained optimization. Existing benchmarks evaluate whether LLMs can formulate optimization problems as solver code, but leave open a complementary question. Can LLMs directly produce correct solutions to fully specified constrained optimization problems without access to a solver? We introduce ConstraintBench, a benchmark for evaluating LLMs on direct constrained optimization across 10 operations research domains, with all ground-truth solutions verified by the Gurobi solver. Each task presents a natural-language scenario with entities, constraints, and an optimization objective; the model must return a structured solution that a deterministic verifier checks against every constraint and the solver-proven optimum. We evaluate six frontier models on 200 tasks and find that feasibility, not optimality, is the primary bottleneck. The best model achieves only 65.0% feasibility, yet feasible solutions average 89 to 96% of the Gurobi-optimal objective. No model exceeds 30.5% on joint feasibility and optimality within 0.1% of the solver reference. Per-domain analysis shows large variation in difficulty, with average feasibility spanning from 85.0% in the facility location domain to 0.8% in the crew assignment domain. Further, systematic failure modes include duration constraint misunderstanding, entity hallucination, and a feasibility-optimality decoupling in facility location and vehicle routing where models achieve high feasibility but 0% optimality. ConstraintBench and all evaluation infrastructure will be publicly released.

翻译：大语言模型正日益应用于底层结构为约束优化的运营决策领域。现有基准测试评估LLM能否将优化问题表述为求解器代码，但留下了一个互补性问题：LLM能否在无需调用求解器的情况下，直接为完全指定的约束优化问题生成正确解？我们提出了ConstraintBench，这是一个用于评估LLM在10个运筹学领域直接进行约束优化求解的基准测试，其所有基准真值解均通过Gurobi求解器验证。每个任务呈现包含实体、约束条件和优化目标的自然语言场景；模型必须返回结构化解，并由确定性验证器对照每个约束条件及求解器验证的最优解进行核查。我们在200个任务上评估了六个前沿模型，发现可行性而非最优性是主要瓶颈。最佳模型仅达到65.0%的可行性，但可行解平均能达到Gurobi最优目标值的89%至96%。所有模型在同时满足可行性和与求解器参考值误差0.1%以内的最优性指标上均未超过30.5%。分领域分析显示难度差异显著，平均可行性从设施选址领域的85.0%到人员排班领域的0.8%不等。此外，系统性失效模式包括：时长约束误解、实体幻觉，以及在设施选址和车辆路径问题中出现的可行性-最优性解耦现象——模型虽获得高可行性但最优性为0%。ConstraintBench及全部评估基础设施将公开发布。