While large language models (LLMs) have shown strong performance in math and logic reasoning, their ability to handle combinatorial optimization (CO) -- searching high-dimensional solution spaces under hard constraints -- remains underexplored. To bridge the gap, we introduce NLCO, a \textbf{N}atural \textbf{L}anguage \textbf{C}ombinatorial \textbf{O}ptimization benchmark that evaluates LLMs on end-to-end CO reasoning: given a language-described decision-making scenario, the model must output a discrete solution without writing code or calling external solvers. NLCO covers 43 CO problems and is organized using a four-layer taxonomy of variable types, constraint families, global patterns, and objective classes, enabling fine-grained evaluation. We provide solver-annotated solutions and comprehensively evaluate LLMs by feasibility, solution optimality, and reasoning efficiency. Experiments across a wide range of modern LLMs show that high-performing models achieve strong feasibility and solution quality on small instances, but both degrade as instance size grows, even if more tokens are used for reasoning. We also observe systematic effects across the taxonomy: set-based tasks are relatively easy, whereas graph-structured problems and bottleneck objectives lead to more frequent failures.
翻译:尽管大语言模型(LLMs)在数学与逻辑推理方面展现出强大性能,但其处理组合优化(CO)——即在严格约束下搜索高维解空间的能力——仍未得到充分探索。为填补这一空白,我们提出了NLCO(自然语言组合优化基准),该基准通过端到端的组合优化推理评估大语言模型:给定一个语言描述的实际决策场景,模型必须直接输出离散解,而无需编写代码或调用外部求解器。NLCO涵盖43个组合优化问题,并采用变量类型、约束族、全局模式与目标类别的四层分类体系进行组织,支持细粒度评估。我们提供了求解器标注的参考解,并从可行性、解的最优性及推理效率三个维度对大语言模型进行全面评估。对多种现代大语言模型的实验表明,高性能模型在小型实例上能获得良好的可行性与解质量,但随着实例规模增大,即使使用更多推理令牌,这两项指标均会显著下降。我们还观察到分类体系中存在的系统性差异:基于集合的任务相对容易,而图结构问题与瓶颈目标则会导致更频繁的推理失败。