The advent of pre-trained code language models (CodeLMs) has lead to significant progress in language-to-code generation. State-of-the-art approaches in this area combine CodeLM decoding with sample pruning and reranking using test cases or heuristics based on the execution results. However, it is challenging to obtain test cases for many real-world language-to-code applications, and heuristics cannot well capture the semantic features of the execution results, such as data type and value range, which often indicates the correctness of the program. In this work, we propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results. Specifically, we train verifiers to determine whether a program sampled from the CodeLM is correct or not based on the natural language input, the program itself and its execution results. The sampled programs are reranked by combining the verification score with the CodeLM generation probability, and marginalizing over programs with the same execution results. On four datasets across the domains of table QA, math QA and basic Python programming, LEVER consistently improves over the base CodeLMs (4.6% to 10.9% with code-davinci-002) and achieves new state-of-the-art results on all of them.
翻译:预训练代码语言模型(CodeLMs)的出现推动了语言到代码生成领域的显著进展。该领域最先进的方法将CodeLM解码与基于测试用例或执行结果启发式规则的样本剪枝及重排序相结合。然而,在许多实际的语言到代码应用中获取测试用例具有挑战性,而启发式规则难以充分捕捉执行结果的语义特征(如数据类型与值域),这些特征往往指示程序的正确性。本文提出LEVER,一种通过学习利用执行结果验证生成程序的简洁方法。具体而言,我们训练验证器根据自然语言输入、程序本身及其执行结果来判断CodeLM采样所得程序是否正确。通过将验证分数与CodeLM生成概率相结合,并对具有相同执行结果的程序进行边缘化处理,对采样程序进行重排序。在表问答、数学问答及基础Python编程四个数据集上,LEVER相较于基础CodeLM(基于code-davinci-002提升4.6%至10.9%)取得持续改进,并在所有数据集上刷新了最先进结果。