Calibration and Correctness of Language Models for Code

Machine learning models are widely used, but can also often be wrong. Users would benefit from a reliable indication of whether a given output from a given model should be trusted, so a rational decision can be made whether to use the output or not. For example, outputs can be associated with a confidence measure; if this confidence measure is strongly associated with likelihood of correctness, then the model is said to be well-calibrated. A well-calibrated confidence measure can serve as a basis for rational, graduated decision-making on how much review and care is needed when using generated code. Calibration has so far been studied in mostly non-generative (e.g. classification) settings, especially in software engineering. However, generated code can quite often be wrong: Given generated code, developers must decide whether to use directly, use after varying intensity of careful review, or discard model-generated code. Thus, calibration is vital in generative settings. We make several contributions. We develop a framework for evaluating the calibration of code-generating models. We consider several tasks, correctness criteria, datasets, and approaches, and find that, by and large, generative code models we test are not well-calibrated out of the box. We then show how calibration can be improved using standard methods, such as Platt scaling. Since Platt scaling relies on the prior availability of correctness data, we evaluate the applicability and generalizability of Platt scaling in software engineering, discuss settings where it has good potential for practical use, and settings where it does not. Our contributions will lead to better-calibrated decision-making in the current use of code generated by language models, and offers a framework for future research to further improve calibration methods for generative models in software engineering.

翻译：机器学习模型被广泛应用，但也常常出错。用户将受益于一个可靠的指示，以判断是否应该信任给定模型的特定输出，从而理性地决定是否使用该输出。例如，输出可以与置信度度量相关联；如果这种置信度度量与正确可能性高度相关，则该模型被认为是经过良好校准的。一个经过良好校准的置信度度量可以作为理性、分级决策的基础，用于决定在使用生成的代码时需要多少审查和关注。迄今为止，校准主要在非生成式（例如分类）环境中进行研究，尤其是在软件工程领域。然而，生成的代码常常可能是错误的：面对生成的代码，开发者必须决定是直接使用、经过不同程度的仔细审查后使用，还是丢弃模型生成的代码。因此，在校准生成式环境中至关重要。我们做出了几项贡献。我们开发了一个评估代码生成模型校准度的框架。我们考虑了多个任务、正确性标准、数据集和方法，并发现，总体而言，我们测试的生成式代码模型在未经调整的情况下并未得到良好校准。然后，我们展示了如何使用标准方法（例如Platt缩放）来改进校准。由于Platt缩放依赖于事先可用的正确性数据，我们评估了Platt缩放在软件工程中的适用性和泛化性，讨论了其具有良好实际应用潜力的场景以及不适用的场景。我们的贡献将有助于在当前使用语言模型生成的代码时做出更优的校准决策，并为未来研究提供了一个框架，以进一步改进软件工程中生成式模型的校准方法。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日