Can an LLM learn how an optimizer behaves -- and use that knowledge to control it? We extend Code World Models (CWMs), LLM-synthesized Python programs that predict environment dynamics, from deterministic games to stochastic combinatorial optimization. Given suboptimal trajectories of $(1{+}1)$-$\text{RLS}_k$, the LLM synthesizes a simulator of the optimizer's dynamics; greedy planning over this simulator then selects the mutation strength $k$ at each step. On \lo{} and \onemax{}, CWM-greedy performs within 6\% of the theoretically optimal policy -- without ever seeing optimal-policy trajectories. On \jump{$_k$}, where a deceptive valley causes all adaptive baselines to fail (0\% success rate), CWM-greedy achieves 100\% success rate -- without any collection policy using oracle knowledge of the gap parameter. On the NK-Landscape, where no closed-form model exists, CWM-greedy outperforms all baselines across fifteen independently generated instances ($36.94$ vs.\ $36.32$; $p<0.001$) when the prompt includes empirical transition statistics. The CWM also outperforms DQN in sample efficiency (200 offline trajectories vs.\ 500 online episodes), success rate (100\% vs.\ 58\%), and generalization ($k{=}3$: 78\% vs.\ 0\%). Robustness experiments confirm stable synthesis across 5 independent runs.
翻译:大型语言模型能否学习优化器的行为——并利用该知识来控制它?我们将代码世界模型(CWM)——即由LLM合成的用于预测环境动态的Python程序——从确定性博弈扩展到随机组合优化领域。给定$(1{+}1)$-$\text{RLS}_k$算法的次优轨迹,LLM可合成优化器动态的模拟器;基于该模拟器的贪婪规划可在每一步选择变异强度$k$。在\lo{}和\onemax{}问题上,CWM贪婪策略的表现与理论最优策略的差距在6\%以内——且从未观测过最优策略轨迹。在\jump{$_k$}问题上,由于欺骗性山谷导致所有自适应基线方法均失效(成功率0\%),CWM贪婪策略实现了100\%的成功率——且未使用任何需要间隙参数先验知识的收集策略。在NK景观问题上,由于不存在封闭形式模型,当提示中包含经验转移统计量时,CWM贪婪策略在15个独立生成实例上均优于所有基线方法($36.94$对比$36.32$;$p<0.001$)。CWM在样本效率(200条离线轨迹对比500次在线回合)、成功率(100\%对比58\%)和泛化能力($k{=}3$时:78\%对比0\%)方面均优于DQN。鲁棒性实验证实了在5次独立运行中合成的稳定性。