Large language models (LLMs) can translate natural language into optimization code, but silent failures pose a critical risk: code that executes and returns solver-feasible solutions may encode semantically incorrect formulations, creating a feasibility-correctness gap of up to 90 percentage points on compositional problems. We introduce ReLoop, addressing silent failures from two complementary directions. Structured generation decomposes code production into a four-stage reasoning chain (understand, formalize, synthesize, verify) that mirrors expert modeling practice, with explicit variable-type reasoning and self-verification to prevent formulation errors at their source. Behavioral verification detects errors that survive generation by testing whether the formulation responds correctly to solver-based parameter perturbation, without requiring ground truth -- an external semantic signal that bypasses the self-consistency problem inherent in LLM-based code review. The two mechanisms are complementary: structured generation dominates on complex compositional problems, while behavioral verification becomes the largest single contributor on problems with localized formulation defects. Together with execution recovery via IIS-enhanced diagnostics, ReLoop raises correctness from 22.6% to 31.1% and execution from 72.1% to 100.0% on the strongest model, with consistent gains across five models spanning three paradigms (foundation, SFT, RL) and three benchmarks. We additionally release RetailOpt-190, 190 compositional retail optimization scenarios targeting the multi-constraint interactions where LLMs most frequently fail.
翻译:大型语言模型(LLM)能够将自然语言转化为优化代码,但静默失败构成了关键风险:那些能够执行并返回求解器可行解的代码可能编码了语义错误的公式,在组合问题上会产生高达90个百分点的可行性-正确性差距。我们提出ReLoop,从两个互补方向应对静默失败。结构化生成将代码生成分解为四阶段推理链(理解、形式化、合成、验证),该链映射专家建模实践,通过显式的变量类型推理与自我验证从源头防止公式错误。行为验证通过测试公式对基于求解器的参数扰动是否响应正确,来检测在生成阶段未被消除的错误——该方法无需真实标签,利用外部语义信号绕过了基于LLM的代码审查固有的自一致性问题。两种机制具有互补性:结构化生成在复杂组合问题上占主导地位,而行为验证在存在局部公式缺陷的问题上成为最大单一贡献因素。结合通过IIS增强诊断实现的执行恢复,ReLoop在最强模型上将正确率从22.6%提升至31.1%,执行率从72.1%提升至100.0%,并在涵盖三种范式(基础模型、监督微调、强化学习)的五个模型和三个基准测试中均取得稳定增益。我们额外发布了RetailOpt-190数据集,包含190个面向零售优化的组合场景,专门针对LLM最常失效的多约束交互场景。