Machine learning models are widely used but can also often be wrong. Users would benefit from a reliable indication of whether a given output from a given model should be trusted, so a rational decision can be made whether to use the output or not. For example, outputs can be associated with a confidence measure; if this confidence measure is strongly associated with likelihood of correctness, then the model is said to be well-calibrated. In this case, for example, high-confidence outputs could be safely accepted, and low-confidence outputs rejected. Calibration has so far been studied in non-generative (e.g., classification) settings, especially in Software Engineering. However, generated code can quite often be wrong: Developers need to know when they should e.g., directly use, use after careful review, or discard model-generated code; thus Calibration is vital in generative settings. However, the notion of correctness of generated code is non-trivial, and thus so is Calibration. In this paper we make several contributions. We develop a framework for evaluating the Calibration of code-generating models. We consider several tasks, correctness criteria, datasets, and approaches, and find that by and large generative code models are not well-calibrated out of the box. We then show how Calibration can be improved, using standard methods such as Platt scaling. Our contributions will lead to better-calibrated decision-making in the current use of code generated by language models, and offers a framework for future research to further improve calibration methods for generative models in Software Engineering.
翻译:机器学习模型被广泛使用,但时常可能出错。用户需要可靠依据来判断是否应信任模型给出的特定输出,从而理性决定是否采用该输出。例如,输出可附带置信度度量;若该置信度与正确可能性高度相关,则称该模型经过良好校准。此时,高置信度输出可安全接受,低置信度输出则被拒绝。目前,校准主要针对非生成式场景(如分类任务)展开研究,尤其在软件工程领域。然而,生成的代码常常存在错误:开发者需要明确何时应直接使用、经仔细审查后使用或弃用模型生成的代码。因此,校准在生成式场景中至关重要。但生成代码正确性的概念并非简单明确,校准也随之变得复杂。本文做出多项贡献:我们构建了评估代码生成模型校准的框架;通过考察多项任务、正确性标准、数据集与方法,发现生成式代码模型在未经调整时普遍校准不足;进而展示了如何利用普拉特缩放等标准方法改进校准效果。本研究成果将推动当前语言模型生成代码使用中更优校准决策的实现,并为软件工程领域生成式模型校准方法的后续研究提供框架支撑。