We introduce MathConstraint, a hard, adaptive benchmark for evaluating the combinatorial reasoning capabilities of LLMs. We combine constraint satisfaction problems with rigorous solver-based verification and design an adaptive generator to create instances that remain challenging as the LLMs improve in their reasoning capabilities. Unlike existing benchmarks that quickly saturate on fixed datasets or use LLM-as-a-judge for checking solutions,MathConstraint uses parameterized problem types that enable scalable generation of arbitrarily difficult and automatically verifiable instances. We release MathConstraint-Easy ($266$ instances), on which frontier models achieve between $72.6\%$ (gemini-3.1-flash-lite) and $87.6\%$ (gpt-5.5) accuracy, and MathConstraint ($329$ instances) on which the same models drop to between $18.5\%$ (claude-4.6-sonnet) and $66.9\%$ (gpt-5.5) accuracy, demonstrating the resilience of our benchmark generator against rapid progress in LLM reasoning capabilities. We evaluate 12 frontier and open-weight models with and without access to a sandboxed Python environment that includes generic SAT/SMT solvers. Tool access roughly doubles frontier accuracy on MathConstraint (mean $+28$pp; up to $+52$pp for claude-4.6-sonnet). Further, halving the tool-call budget from $8$ to $4$ rounds erases up to $37$ points -- a sensitivity that most single-budget benchmarks miss. We release the generator, dataset, and evaluation harness as a robust environment for studying combinatorial reasoning and tool-use behavior under adversarially-tunable difficulty.
翻译:我们提出MathConstraint,一个用于评估大语言模型(LLM)组合推理能力的硬性自适应基准。该基准将约束满足问题与严格的求解器验证相结合,并设计自适应生成器以创建随LLM推理能力提升而持续具有挑战性的实例。与那些在固定数据集上快速饱和或使用LLM作为裁判检查解决方案的现有基准不同,MathConstraint采用参数化问题类型,支持可扩展生成任意难度且可自动验证的实例。我们发布了MathConstraint-Easy(266个实例),前沿模型在其上的准确率介于72.6%(gemini-3.1-flash-lite)至87.6%(gpt-5.5)之间;以及MathConstraint(329个实例),相同模型在其上的准确率降至18.5%(claude-4.6-sonnet)至66.9%(gpt-5.5)之间,展示了我们的基准生成器对LLM推理能力快速进步的鲁棒性。我们评估了12个前沿和开源权重模型,分别考察其在有无沙盒化Python环境(包含通用SAT/SMT求解器)下的表现。工具访问使MathConstraint上的前沿模型准确率大致翻倍(平均提升28个百分点;claude-4.6-sonnet提升高达52个百分点)。此外,将工具调用预算从8轮减半至4轮,会导致准确率下降高达37个百分点——这种敏感性是大多数单预算基准所忽视的。我们发布生成器、数据集和评估框架,作为在对抗性可调难度下研究组合推理与工具使用行为的鲁棒环境。