$\mathcal{B}$-Coder: Value-Based Deep Reinforcement Learning for Program Synthesis

Program synthesis aims to create accurate, executable programs from problem specifications, specifically from natural language descriptions in our context. Recent studies have leveraged the power of reinforcement learning (RL) in conjunction with large language models (LLMs), significantly enhancing code generation capabilities. The application of RL focuses on directly optimizing for functional correctness, offering an advantage over conventional supervised methods. Despite policy-based RL methods dominating the literature on RL for program synthesis, the nature of program synthesis tasks hints at a natural alignment with value-based methods. This stems from the rich collection of off-policy programs, including those developed by human programmers and also historical samples, coupled with the straightforward verification of generated programs through automated unit testing, meaning rewards are easy to obtain. Diverging from the dominant use of policy-based algorithms, our work explores the feasibility of value-based approaches, leading to the development of our $\mathcal{B}$-Coder (pronounced Bellman coder). Yet, training value-based methods presents challenges due to the enormous search space inherent to program synthesis. To this end, we introduce an initialization protocol for RL agents utilizing pre-trained LMs and a conservative Bellman operator to reduce training complexities. Moreover, we demonstrate how to leverage the learned value functions as a dual strategy to post-process generated programs. Our empirical evaluations demonstrated $\mathcal{B}$-Coder's capability in achieving state-of-the-art performance when compared to policy-based methods. Remarkably, this achievement is reached with minimal reward engineering effort, highlighting the effectiveness of value-based RL, independent of reward designs.

翻译：程序合成旨在从问题规范（本文中特指自然语言描述）生成准确可执行的程序。近期研究借助强化学习（RL）与大型语言模型（LLMs）的结合，显著提升了代码生成能力。RL的应用聚焦于直接优化功能正确性，相较于传统监督方法具有优势。尽管基于策略的RL方法在程序合成的RL文献中占据主导地位，但程序合成任务本身暗示了其与基于价值方法的天然契合。这源于丰富的离策略程序集合（包括人类程序员编写的程序及历史样本），以及通过自动化单元测试验证生成程序的便捷性，意味着奖励信号易于获取。与当前主流的策略算法不同，本文探索了基于价值方法的可行性，并由此提出$\mathcal{B}$-Coder（读作Bellman coder）。然而，由于程序合成固有的庞大搜索空间，训练基于价值的方法面临挑战。为此，我们引入了利用预训练语言模型初始化RL智能体的协议，并结合保守贝尔曼算子以降低训练复杂度。此外，我们展示了如何将学习到的价值函数作为双重策略用于后处理生成程序。实验评估表明，与基于策略的方法相比，$\mathcal{B}$-Coder能够实现最先进的性能。值得关注的是，这一成果在最小化奖励工程努力下达成，凸显了独立于奖励设计的基于价值RL的有效性。