Machine learning models are widely used but can also often be wrong. Users would benefit from a reliable indication of whether a given output from a given model should be trusted, so a rational decision can be made whether to use the output or not. For example, outputs can be associated with a confidence measure; if this confidence measure is strongly associated with likelihood of correctness, then the model is said to be well-calibrated. In this case, for example, high-confidence outputs could be safely accepted, and low-confidence outputs rejected. Calibration has so far been studied in mostly non-generative (e.g., classification) settings, especially in Software Engineering. However, generated code can quite often be wrong: Developers need to know when they should e.g., directly use, use after careful review, or discard model-generated code; thus Calibration is vital in generative settings. However, the notion of correctness of generated code is non-trivial, and thus so is Calibration. In this paper we make several contributions. We develop a framework for evaluating the Calibration of code-generating models. We consider several tasks, correctness criteria, datasets, and approaches, and find that by and large generative code models are not well-calibrated out of the box. We then show how Calibration can be improved, using standard methods such as Platt scaling. Our contributions will lead to better-calibrated decision-making in the current use of code generated by language models, and offers a framework for future research to further improve calibration methods for generative models in Software Engineering.
翻译:机器学习模型被广泛应用,但时常产生错误结果。用户若能获得关于模型输出结果是否可信的可靠指示,便能理性决策是否采纳该输出。例如,可为输出结果附加置信度度量:若该置信度与结果正确可能性高度相关,则称该模型具有良好校准性。在此情况下,高置信度输出可被安全采纳,低置信度输出则被拒收。现有校准研究主要集中在非生成式场景(如分类任务),软件工程领域尤甚。然而,生成的代码时常存在错误:开发者需要明确何时可直接使用、经审慎审查后使用或直接舍弃模型生成的代码。因此,校准在生成式场景中至关重要。但生成代码的正确性定义具有复杂性,校准问题亦随之复杂化。本文做出多项贡献:构建了代码生成模型校准评估框架,涵盖多类任务、正确性判定标准、数据集与评估方法,发现现有生成式代码模型普遍未达出厂校准水平。进而展示如何通过普拉特缩放等标准方法改进校准效果。本研究成果将提升当前语言模型生成代码的使用中基于校准的决策质量,并为软件工程领域生成式模型校准方法的后续研究提供基准框架。