Recently, there have been significant research interests in training large language models (LLMs) with reinforcement learning (RL) on real-world tasks, such as multi-turn code generation. While online RL tends to perform better than offline RL, its higher training cost and instability hinders wide adoption. In this paper, we build on the observation that multi-turn code generation can be formulated as a one-step recoverable Markov decision process and propose contextual bandit learning with offline trajectories (Cobalt), a new method that combines the benefits of online and offline RL. Cobalt first collects code generation trajectories using a reference LLM and divides them into partial trajectories as contextual prompts. Then, during online bandit learning, the LLM is trained to complete each partial trajectory prompt through single-step code generation. Cobalt outperforms two multi-turn online RL baselines based on GRPO and VeRPO, and substantially improves R1-Distill 8B and Qwen3 8B by up to 9.0 and 6.2 absolute Pass@1 scores on LiveCodeBench. Also, we analyze LLMs' in-context reward hacking behaviors and augment Cobalt training with perturbed trajectories to mitigate this issue. Overall, our results demonstrate Cobalt as a promising solution for iterative decision-making tasks like multi-turn code generation. Our code and data are available at https://github.com/OSU-NLP-Group/cobalt.
翻译:近年来,利用强化学习在真实世界任务(如多轮代码生成)上训练大语言模型的研究兴趣日益增长。尽管在线强化学习通常表现优于离线强化学习,但其较高的训练成本和不稳定性阻碍了广泛应用。本文基于多轮代码生成可被建模为一步可恢复马尔可夫决策过程的观察,提出了基于离线轨迹的上下文赌博机学习方法——一种融合在线与离线强化学习优势的新方法。该方法首先使用参考大语言模型收集代码生成轨迹,并将其分割为作为上下文提示的部分轨迹。随后,在在线赌博机学习阶段,通过单步代码生成训练大语言模型完成每个部分轨迹提示。该方法在基于GRPO和VeRPO构建的两个多轮在线强化学习基线模型上均表现出更优性能,并在LiveCodeBench基准上将R1-Distill 8B和Qwen3 8B模型的绝对Pass@1分数分别显著提升9.0和6.2分。此外,我们分析了大语言模型的上下文奖励操纵行为,并通过引入扰动轨迹增强训练以缓解该问题。总体而言,我们的结果表明该方法为多轮代码生成等迭代决策任务提供了有前景的解决方案。代码与数据已开源:https://github.com/OSU-NLP-Group/cobalt。