Machine learning models are widely used, but can also often be wrong. Users would benefit from a reliable indication of whether a given output from a given model should be trusted, so a rational decision can be made whether to use the output or not. For example, outputs can be associated with a confidence measure; if this confidence measure is strongly associated with likelihood of correctness, then the model is said to be well-calibrated. A well-calibrated confidence measure can serve as a basis for rational, graduated decision-making on how much review and care is needed when using generated code. Calibration has so far been studied in mostly non-generative (e.g. classification) settings, especially in software engineering. However, generated code can quite often be wrong: Given generated code, developers must decide whether to use directly, use after varying intensity of careful review, or discard model-generated code. Thus, calibration is vital in generative settings. We make several contributions. We develop a framework for evaluating the calibration of code-generating models. We consider several tasks, correctness criteria, datasets, and approaches, and find that, by and large, generative code models we test are not well-calibrated out of the box. We then show how calibration can be improved using standard methods, such as Platt scaling. Since Platt scaling relies on the prior availability of correctness data, we evaluate the applicability and generalizability of Platt scaling in software engineering, discuss settings where it has good potential for practical use, and settings where it does not. Our contributions will lead to better-calibrated decision-making in the current use of code generated by language models, and offers a framework for future research to further improve calibration methods for generative models in software engineering.
翻译:机器学习模型被广泛应用,但也常常出错。用户将受益于一个可靠的指示,以判断是否应该信任给定模型的特定输出,从而理性地决定是否使用该输出。例如,输出可以与置信度度量相关联;如果这种置信度度量与正确可能性高度相关,则该模型被认为是经过良好校准的。一个经过良好校准的置信度度量可以作为理性、分级决策的基础,用于决定在使用生成的代码时需要多少审查和关注。迄今为止,校准主要在非生成式(例如分类)环境中进行研究,尤其是在软件工程领域。然而,生成的代码常常可能是错误的:面对生成的代码,开发者必须决定是直接使用、经过不同程度的仔细审查后使用,还是丢弃模型生成的代码。因此,在校准生成式环境中至关重要。我们做出了几项贡献。我们开发了一个评估代码生成模型校准度的框架。我们考虑了多个任务、正确性标准、数据集和方法,并发现,总体而言,我们测试的生成式代码模型在未经调整的情况下并未得到良好校准。然后,我们展示了如何使用标准方法(例如Platt缩放)来改进校准。由于Platt缩放依赖于事先可用的正确性数据,我们评估了Platt缩放在软件工程中的适用性和泛化性,讨论了其具有良好实际应用潜力的场景以及不适用的场景。我们的贡献将有助于在当前使用语言模型生成的代码时做出更优的校准决策,并为未来研究提供了一个框架,以进一步改进软件工程中生成式模型的校准方法。