Optimization modeling is inherently hierarchical, requiring a precise sequence of symbolic commitments. Traditional learning-based automated optimization modeling methods improve modeling policies through large-scale annotated or curated training data, but are costly to adapt to new problem distributions. Meanwhile, one-shot generation remains brittle in hierarchical modeling, where early symbolic errors can propagate into invalid formulations. Test-time scaling offers a promising alternative by enabling structural exploration with additional instance-level computation; however, existing search-based methods typically rely on a fixed policy, causing repeated rollouts to inherit similar modeling biases and providing limited credit assignment for intermediate decisions. To address these limitations, we propose StarOR, a synergistic search-and-adaptation framework that couples MCTS with Test-Time Reinforcement Learning for optimization modeling. StarOR decomposes the modeling process into four stages and updates a transient LoRA adapter via GRPO at each non-terminal node. By using MCTS-generated siblings as local comparison sets, StarOR transforms search-time exploration into instance-specific policy refinement. Moreover, an unsupervised multi-faceted reward system provides fine-grained feedback for intermediate formulation decisions without ground-truth labels. Experiments across five optimization benchmarks show that StarOR achieves state-of-the-art performance even with a 4B backbone, outperforming existing methods and the frontier LLMs.
翻译:优化建模本质上是层次化的,需要精确的符号承诺序列。传统的基于学习的自动化优化建模方法通过大规模标注或整理训练数据来改进建模策略,但适应新问题分布的成本高昂。同时,一次性生成在层次化建模中仍然脆弱,早期符号错误可能传播为无效的公式。测试时缩放通过额外实例级计算实现结构探索提供了一种有前景的替代方案;然而,现有基于搜索的方法通常依赖固定策略,导致重复 rollout 继承相似的建模偏差,且对中间决策的信用分配有限。为解决这些局限,我们提出 StarOR——一种协同搜索与自适应框架,将 MCTS 与测试时强化学习相结合用于优化建模。StarOR 将建模过程分解为四个阶段,并通过 GRPO 在非叶节点处更新瞬态 LoRA 适配器。通过利用 MCTS 生成的兄弟节点作为局部比较集,StarOR 将搜索时探索转化为实例特定的策略细化。此外,无监督的多面奖励系统为中间公式决策提供细粒度反馈,无需真实标签。在五个优化基准上的实验表明,即使使用 4B 骨干网络,StarOR 也能达到最先进性能,优于现有方法和前沿 LLM。