Large language models (LLMs) deployed as agents solve user-specified tasks over multiple steps while keeping the required manual engagement to a minimum. Crucially, such LLMs need to ground their generations in any feedback obtained to reliably achieve the desired outcomes. We propose an end-to-end reinforcement learning method for teaching models to leverage execution feedback in the realm of code synthesis, where state-of-the-art LLMs struggle to improve code iteratively compared to independent sampling. We benchmark on competitive programming tasks, where we achieve new state-of-the art results with both small (8B parameters) and large (70B) models while reducing the amount of samples required by an order of magnitude. Our analysis of inference-time behavior demonstrates that our method produces LLMs that effectively leverage automatic feedback over multiple steps.
翻译:作为智能体部署的大语言模型(LLMs)通过多步骤解决用户指定的任务,同时将所需的人工参与降至最低。关键在于,此类模型必须将其生成内容基于所获得的反馈进行修正,以可靠实现预期结果。我们提出一种端到端的强化学习方法,用于训练模型在代码合成领域利用执行反馈——当前最先进的LLMs在该领域难以通过迭代方式改进代码,而独立采样则效果有限。我们在竞技编程任务上进行基准测试,使用小型(80亿参数)和大型(700亿参数)模型均取得了新的最优结果,同时将所需样本量降低了一个数量级。对推理时行为的分析表明,我们的方法能够生成在多步骤中有效利用自动反馈的LLMs。