A good summary can often be very useful during program comprehension. While a brief, fluent, and relevant summary can be helpful, it does require significant human effort to produce. Often, good summaries are unavailable in software projects, thus making maintenance more difficult. There has been a considerable body of research into automated AI-based methods, using Large Language models (LLMs), to generate summaries of code; there also has been quite a bit work on ways to measure the performance of such summarization methods, with special attention paid to how closely these AI-generated summaries resemble a summary a human might have produced. Measures such as BERTScore and BLEU have been suggested and evaluated with human-subject studies. However, LLMs often err and generate something quite unlike what a human might say. Given an LLM-produced code summary, is there a way to gauge whether it's likely to be sufficiently similar to a human produced summary, or not? In this paper, we study this question, as a calibration problem: given a summary from an LLM, can we compute a confidence measure, which is a good indication of whether the summary is sufficiently similar to what a human would have produced in this situation? We examine this question using several LLMs, for several languages, and in several different settings. We suggest an approach which provides well-calibrated predictions of likelihood of similarity to human summaries.
翻译:好的摘要往往在程序理解中非常有价值。尽管简洁、流畅且相关的摘要很有帮助,但其生成需要大量人工努力。通常,软件项目中缺乏高质量的摘要,从而增加了维护难度。已有大量研究探索基于人工智能的自动化方法,利用大型语言模型(LLMs)生成代码摘要;同时,也有不少工作致力于衡量这类摘要生成方法的性能,特别关注这些AI生成的摘要与人类可能生成的摘要之间的相似程度。诸如BERTScore和BLEU等指标已被提出,并通过人工研究进行了评估。然而,LLMs常常出错,生成的内容与人类描述大相径庭。对于由LLM生成的代码摘要,是否存在一种方法可以判断其是否与人工生成的摘要足够相似?本文将此问题作为校准问题进行研究:给定LLM生成的摘要,我们能否计算出一个置信分数,以有效指示该摘要是否与人类在该情境下可能生成的摘要足够相似?我们针对多种LLM、多种编程语言以及多种不同设置对这一疑问进行了探讨。我们提出了一种方法,能够提供经过良好校准的预测,指示摘要与人类摘要的相似性可能性。