Large Language Models (LLMs) pre-trained on code have recently emerged as the dominant approach to program synthesis. However, these models are trained using next-token prediction, which ignores the syntax and semantics of code. We propose RLCF, that further trains a pre-trained LLM via reinforcement learning, using feedback from a grounding function that scores the quality of the code. The grounding function uses (i) compiler-derived feedback on whether the code it generates passes a set of correctness checks; and (ii) feedback from a different LLM that compares the generated code to a reference code. RLCF is model- and language-agnostic. We empirically evaluate it on the MBJP and MathQA tasks for Java. Our experiments show that RLCF raises the odds that an LLM-generated program compiles, is executable, and produces the right output on tests, often allowing LLMs to match the performance of 2x-8x larger LLMs.
翻译:近期,预训练于代码的大型语言模型已成为程序合成的主流方法。然而,这些模型采用下一词元预测方式进行训练,忽略了代码的语法与语义。我们提出RLCF方法,通过强化学习对预训练的语言模型进行进一步训练,利用评估代码质量的规范化函数反馈进行优化。该规范化函数包含两类反馈:(i)基于编译器对代码通过正确性检查的反馈;(ii)由另一语言模型对生成代码与参考代码进行对比的反馈。RLCF方法具有模型与语言无关性。我们在Java语言的MBJP与MathQA任务上进行了实证评估。实验表明,RLCF能提升语言模型生成代码的编译通过率、可执行性及测试输出正确性,通常可使小规模语言模型达到2-8倍参数规模的更大模型性能水平。