Operations Research practitioners routinely debug infeasible models through an iterative process: analyzing Irreducible Infeasible Subsystems (\IIS{}), identifying constraint conflicts, and systematically repairing formulations until feasibility is achieved. Yet existing LLM benchmarks evaluate OR as one-shot translation -- given a problem description, generate solver code -- ignoring this diagnostic loop entirely. We introduce two benchmarks that place the \textbf{solver in the evaluation loop}. \textbf{\ORDebug{}} evaluates iterative self-correction through 5,000+ problems spanning 9 error types; each repair action triggers solver re-execution and \IIS{} recomputation, providing deterministic, verifiable feedback. \textbf{\ORBias{}} evaluates behavioral rationality through 2,000 newsvendor instances (1,000 ID + 1,000 OOD), measuring systematic deviations from closed-form optimal policies. Across 26 models and 12,000+ samples, we find that domain-specific RLVR training enables an 8B model to surpass frontier APIs: 95.3\% vs 86.2\% recovery rate (+9.1\%), 62.4\% vs 47.8\% diagnostic accuracy (+14.6\%), and 2.25 vs 3.78 steps to resolution (1.7$\times$ faster). On \ORBias{}, curriculum training achieves the only negative ID$\rightarrow$OOD bias drift among models evaluated (-9.6\%), reducing systematic bias by 48\% (from 20.0\% to 10.4\%). These results demonstrate that process-level evaluation with verifiable oracles enables targeted training that outperforms scale.
翻译:运筹学从业者通常通过一个迭代过程调试不可行模型:分析不可约不可行子系统(IIS),识别约束冲突,并系统性地修正模型表述直至实现可行性。然而,现有的大语言模型基准将运筹学任务评估为一次性翻译——给定问题描述,生成求解器代码——完全忽略了这一诊断循环。我们引入了两个将**求解器置于评估循环中**的基准。**ORDebug** 通过涵盖9种错误类型的5000多个问题评估迭代自校正能力;每次修正操作都会触发求解器重新执行和IIS重新计算,提供确定性的、可验证的反馈。**ORBias** 通过2000个报童问题实例(1000个同分布 + 1000个分布外)评估行为理性,测量其与闭式最优策略的系统性偏差。通过对26个模型和超过12000个样本的测试,我们发现领域特定的RLVR训练能使一个80亿参数的模型超越前沿API:恢复率95.3%对86.2%(提升9.1%),诊断准确率62.4%对47.8%(提升14.6%),以及平均2.25步对3.78步的解决步数(提速1.7倍)。在ORBias上,课程训练实现了所有评估模型中唯一的负向同分布→分布外偏差漂移(-9.6%),将系统性偏差降低了48%(从20.0%降至10.4%)。这些结果表明,结合可验证预言机的过程级评估能够实现超越模型规模的针对性训练。