The advent of large language models trained on code (code LLMs) has led to significant progress in language-to-code generation. State-of-the-art approaches in this area combine LLM decoding with sample pruning and reranking using test cases or heuristics based on the execution results. However, it is challenging to obtain test cases for many real-world language-to-code applications, and heuristics cannot well capture the semantic features of the execution results, such as data type and value range, which often indicates the correctness of the program. In this work, we propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results. Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results. The sampled programs are reranked by combining the verification score with the LLM generation probability, and marginalizing over programs with the same execution results. On four datasets across the domains of table QA, math QA and basic Python programming, LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci-002) and achieves new state-of-the-art results on all of them.
翻译:基于代码训练的大型语言模型(代码LLM)的出现,推动了语言到代码生成领域的重大进展。当前最先进的方法将LLM解码与基于测试用例或执行结果的启发式规则的样本剪枝和重新排序相结合。然而,在许多现实世界的语言到代码应用中,获取测试用例具有挑战性,而启发式规则难以充分捕捉执行结果的语义特征(如数据类型和值范围),这些特征往往能指示程序的正确性。在这项工作中,我们提出LEVER,一种通过学习验证生成程序及其执行结果来改进语言到代码生成的简单方法。具体而言,我们训练验证器,根据自然语言输入、程序本身及其执行结果,判断从LLM中采样的程序是否正确。通过结合验证分数与LLM生成概率,并对具有相同执行结果的程序进行边缘化,对采样的程序进行重新排序。在表格问答、数学问答及基础Python编程四个数据集上,LEVER在基础代码LLM(使用code-davinci-002时提升4.6%至10.9%)上持续改进,并在所有数据集上取得了新的最先进结果。