We develop an approach to estimate the probability that a program sampled from a large language model is correct. Given a natural language description of a programming problem, our method samples both candidate programs as well as candidate predicates specifying how the program should behave. This allows learning a model that forms a well-calibrated probabilistic prediction of program correctness. Our system also infers which predicates are useful to explain the behavior of the generated code, and humans preferred these in a human study over raw language model outputs. Our method is simple, easy to implement, and maintains state of the art generation accuracy results.
翻译:我们提出了一种方法,用于估计从大型语言模型中采样得到的程序正确的概率。针对一个编程问题的自然语言描述,我们的方法同时采样候选程序以及用于指定程序应如何行为的候选谓词。这使得能够学习一个形成良好校准的程序正确性概率预测模型。我们的系统还能推断哪些谓词有助于解释生成代码的行为,在人类研究中,受试者更偏好这些谓词而非原始语言模型的输出。该方法简单易实现,并保持了最先进的生成准确率结果。