Enhancing Trust in LLM-Generated Code Summaries with Calibrated Confidence Scores

A good summary can often be very useful during program comprehension. While a brief, fluent, and relevant summary can be helpful, it does require significant human effort to produce. Often, good summaries are unavailable in software projects, thus making maintenance more difficult. There has been a considerable body of research into automated AI-based methods, using Large Language models (LLMs), to generate summaries of code; there also has been quite a bit work on ways to measure the performance of such summarization methods, with special attention paid to how closely these AI-generated summaries resemble a summary a human might have produced. Measures such as BERTScore and BLEU have been suggested and evaluated with human-subject studies. However, LLM-produced summaries can be too long, irrelevant, etc: generally, too dissimilar to what a human might say. Given an LLM-produced code summary, how can we judge if a summary is good enough? Given some input source code, and an LLM-generated summary, existing approaches can help judge brevity, fluency and relevance; however, it's difficult to gauge whether an LLM-produced summary sufficiently resembles what a human might produce, without a "golden" human-produced summary to compare against. We study this resemblance question as a calibration problem: given just the summary from an LLM, can we compute a confidence measure, that provides a reliable indication of whether the summary sufficiently resembles what a human would have produced in this situation? We examine this question using several LLMs, for several languages, and in several different settings. Our investigation suggests approaches to provide reliable predictions of the likelihood that an LLM-generated summary would sufficiently resemble a summary a human might write for the same code.

翻译：良好的摘要通常对程序理解非常有帮助。虽然简洁、流畅且相关的摘要可能很有用，但其生成需要大量人力投入。在软件项目中，高质量的摘要常常缺失，这使得维护工作更加困难。已有大量研究致力于利用大型语言模型（LLM）实现基于AI的自动化代码摘要生成方法；同时也有相当多的工作关注如何评估此类摘要方法的性能，特别关注AI生成的摘要与人类可能撰写的摘要的相似程度。诸如BERTScore和BLEU等指标已被提出，并通过人类受试研究进行评估。然而，LLM生成的摘要可能存在过长、不相关等问题：通常与人类表达方式差异过大。面对LLM生成的代码摘要，我们如何判断其是否足够好？给定输入源代码和LLM生成的摘要，现有方法可帮助判断简洁性、流畅性和相关性；但在没有"黄金标准"人类摘要作为参照的情况下，很难衡量LLM生成的摘要是否足够接近人类的表达方式。我们将这种相似性问题作为校准问题进行研究：仅基于LLM生成的摘要，能否计算出一个置信度度量，可靠地指示该摘要是否足够接近人类在此情境下可能产生的表达？我们通过多种LLM、多种编程语言及多种不同设置来探讨这个问题。研究表明，我们提出的方法能够可靠预测LLM生成的摘要与人类针对相同代码可能撰写的摘要达到足够相似程度的可能性。