ReLoop: Structured Modeling and Behavioral Verification for Reliable LLM-Based Optimization

Large language models (LLMs) can translate natural language into optimization code, but silent failures pose a critical risk: code that executes and returns solver-feasible solutions may encode semantically incorrect formulations -- a feasibility-correctness gap reaching 90 percentage points on compositional problems. We introduce ReLoop, which addresses this gap through two complementary mechanisms. Structured generation decomposes code production into a four-stage reasoning chain (understand, formalize, synthesize, verify), preventing formulation errors at their source. Behavioral verification detects errors that survive generation by testing whether the formulation responds correctly to solver-based parameter perturbation -- an external semantic signal that bypasses LLM self-review and requires no ground truth. The two mechanisms are complementary by error structure: structured generation drives the largest gains on compositional problems (+8.5pp accuracy on RetailOpt-190 with Claude Opus 4.6), while behavioral verification dominates on localized defects (+4.4pp on MAMO-ComplexLP, its largest contribution across benchmarks). Combined with diagnostic execution recovery, ReLoop reaches 100% executable code on Claude Opus 4.6 and consistently improves accuracy on chat-tuned foundation models across three benchmarks; we further identify a known limitation of narrowly-tuned SFT models, whose learned output formats are brittle to chain-of-thought prompts -- an interaction we document and analyze. We release RetailOpt-190, 190 compositional retail optimization scenarios targeting the multi-constraint interactions where LLMs most frequently fail.

翻译：大语言模型（LLMs）可将自然语言转换为优化代码，但静默失败构成关键风险：能执行并返回求解器可行解的代码，可能编码了语义错误的公式表述——在组合性问题上，这种可行性-正确性差距可达90个百分点。我们提出ReLoop，通过两种互补机制应对这一差距。结构化生成将代码产出分解为四阶段推理链（理解、形式化、综合、验证），从源头防止公式化错误。行为验证通过检测公式对基于求解器的参数扰动的响应正确性，来发现跨越生成阶段残留的错误——这是一种绕开LLM自我审查且无需真实值的外部语义信号。两种机制按错误结构互补：结构化生成在组合性问题中带来最大提升（Claude Opus 4.6在RetailOpt-190上准确率提高8.5个百分点），而行为验证在局部缺陷中占优（在MAMO-ComplexLP上贡献最大提升4.4个百分点）。结合诊断式执行恢复，ReLoop在Claude Opus 4.6上实现100%可执行代码，并在三个基准测试中持续提升聊天调优基础模型的准确率；我们进一步识别出窄调优SFT模型的已知局限性——其学习到的输出格式对思维链提示脆弱，对此我们进行了记录与分析。我们发布RetailOpt-190，包含190个组合性零售优化场景，聚焦LLM最常失败的多约束交互问题。