The advent of large language models trained on code (code LLMs) has led to significant progress in language-to-code generation. State-of-the-art approaches in this area combine LLM decoding with sample pruning and reranking using test cases or heuristics based on the execution results. However, it is challenging to obtain test cases for many real-world language-to-code applications, and heuristics cannot well capture the semantic features of the execution results, such as data type and value range, which often indicates the correctness of the program. In this work, we propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results. Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results. The sampled programs are reranked by combining the verification score with the LLM generation probability, and marginalizing over programs with the same execution results. On four datasets across the domains of table QA, math QA and basic Python programming, LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci-002) and achieves new state-of-the-art results on all of them.
翻译:基于代码训练的大型语言模型(代码LLM)的出现推动了语言到代码生成的重大进展。该领域的前沿方法结合LLM解码与使用测试用例或基于执行结果的启发式规则进行样本剪枝和重排序。然而,在许多现实世界的语言到代码应用中获取测试用例具有挑战性,且启发式规则无法充分捕捉执行结果中的语义特征(如数据类型和值域),而这些特征通常指示程序的正确性。本文提出LEVER,一种通过学习利用执行结果验证生成程序来改进语言到代码生成的简单方法。具体而言,我们训练验证器,根据自然语言输入、程序本身及其执行结果判断LLM采样的程序是否正确。采样程序通过结合验证得分与LLM生成概率进行重排序,并对具有相同执行结果的程序进行边际化处理。在涉及表格问答、数学问答及基础Python编程的四个数据集上,LEVER始终优于基础代码LLM(在code-davinci-002上提升4.6%至10.9%),并在所有数据集上取得新的最优结果。