Large Reasoning Models (LRMs) have demonstrated strong capabilities in complex multi-step reasoning, opening new opportunities for automating optimization modeling. However, existing domain adaptation methods, originally designed for earlier instruction-tuned models, often fail to exploit the advanced reasoning patterns of modern LRMs -- In particular, we show that direct fine-tuning on traditional \textit{non-reflective} datasets leads to limited gains. To fully leverage LRMs' inherent reasoning abilities, we propose \textbf{CALM} (\textit{Corrective Adaptation with Lightweight Modification}), a framework that progressively refines LRMs within their native reasoning modes for optimization modeling tasks. In CALM, an expert intervener identifies reasoning flaws and provides concise corrective hints, which the LRM incorporates to produce improved reasoning trajectories. These interventions modify fewer than 2.6\% of generated tokens, but generate high-quality data for soft adaptation through supervised fine-tuning. The adapted model is then further improved through reinforcement learning. Building on CALM, we develop \textbf{STORM} (\textit{Smart Thinking Optimization Reasoning Model}), a 4B-parameter LRM that achieves a new state-of-the-art average accuracy of 68.9\% across five popular optimization modeling benchmarks, matching the performance of a 671B LRM. These results demonstrate that dynamic, hint-based data synthesis both preserves and amplifies the native reasoning patterns of modern LRMs, offering a more effective and scalable path towards expert-level performance on challenging optimization modeling tasks.
翻译:大型推理模型(LRMs)在复杂的多步推理任务中展现出了强大的能力,为优化建模的自动化开辟了新的机遇。然而,现有的领域适应方法最初是为早期的指令调优模型设计的,往往无法充分利用现代LRMs的高级推理模式——具体而言,我们证明直接在传统的\textit{非反思性}数据集上进行微调带来的增益有限。为了充分利用LRMs固有的推理能力,我们提出了\textbf{CALM}(\textit{轻量修正的校正适应}),这是一个在优化建模任务中,于其原生推理模式内逐步精炼LRMs的框架。在CALM中,专家干预者识别推理缺陷并提供简洁的校正提示,LRM则据此生成改进的推理轨迹。这些干预仅修改少于2.6\%的生成标记,但能通过监督式微调生成高质量数据用于软适应。随后,适应后的模型通过强化学习得到进一步改进。基于CALM,我们开发了\textbf{STORM}(\textit{智能思维优化推理模型}),这是一个拥有40亿参数的LRM,在五个流行的优化建模基准测试中取得了平均68.9\%的最新最优准确率,其性能与一个6710亿参数的LRM相当。这些结果表明,基于提示的动态数据合成不仅能保留,还能增强现代LRMs的原生推理模式,为在具有挑战性的优化建模任务上实现专家级性能提供了一条更有效且可扩展的路径。