Large language models (LLMs) can translate natural language into optimization code, but silent failures pose a critical risk: code that executes and returns solver-feasible solutions may encode semantically incorrect formulations -- a feasibility-correctness gap reaching 90 percentage points on compositional problems. We introduce ReLoop, which addresses this gap through two complementary mechanisms. Structured generation decomposes code production into a four-stage reasoning chain (understand, formalize, synthesize, verify), preventing formulation errors at their source. Behavioral verification detects errors that survive generation by testing whether the formulation responds correctly to solver-based parameter perturbation -- an external semantic signal that bypasses LLM self-review and requires no ground truth. The two mechanisms are complementary by error structure: structured generation drives the largest gains on compositional problems (+8.5pp accuracy on RetailOpt-190 with Claude Opus 4.6), while behavioral verification dominates on localized defects (+4.4pp on MAMO-ComplexLP, its largest contribution across benchmarks). Combined with diagnostic execution recovery, ReLoop reaches 100% executable code on Claude Opus 4.6 and consistently improves accuracy on chat-tuned foundation models across three benchmarks; we further identify a known limitation of narrowly-tuned SFT models, whose learned output formats are brittle to chain-of-thought prompts -- an interaction we document and analyze. We release RetailOpt-190, 190 compositional retail optimization scenarios targeting the multi-constraint interactions where LLMs most frequently fail.
翻译:大语言模型(LLMs)可将自然语言转换为优化代码,但静默失败构成关键风险:能执行并返回求解器可行解的代码,可能编码了语义错误的公式表述——在组合性问题上,这种可行性-正确性差距可达90个百分点。我们提出ReLoop,通过两种互补机制应对这一差距。结构化生成将代码产出分解为四阶段推理链(理解、形式化、综合、验证),从源头防止公式化错误。行为验证通过检测公式对基于求解器的参数扰动的响应正确性,来发现跨越生成阶段残留的错误——这是一种绕开LLM自我审查且无需真实值的外部语义信号。两种机制按错误结构互补:结构化生成在组合性问题中带来最大提升(Claude Opus 4.6在RetailOpt-190上准确率提高8.5个百分点),而行为验证在局部缺陷中占优(在MAMO-ComplexLP上贡献最大提升4.4个百分点)。结合诊断式执行恢复,ReLoop在Claude Opus 4.6上实现100%可执行代码,并在三个基准测试中持续提升聊天调优基础模型的准确率;我们进一步识别出窄调优SFT模型的已知局限性——其学习到的输出格式对思维链提示脆弱,对此我们进行了记录与分析。我们发布RetailOpt-190,包含190个组合性零售优化场景,聚焦LLM最常失败的多约束交互问题。