Several code summarization techniques have been proposed in the literature to automatically document a code snippet or a function. Ideally, software developers should be involved in assessing the quality of the generated summaries. However, in most cases, researchers rely on automatic evaluation metrics such as BLEU, ROUGE, and METEOR. These metrics are all based on the same assumption: The higher the textual similarity between the generated summary and a reference summary written by developers, the higher its quality. However, there are two reasons for which this assumption falls short: (i) reference summaries, e.g., code comments collected by mining software repositories, may be of low quality or even outdated; (ii) generated summaries, while using a different wording than a reference one, could be semantically equivalent to it, thus still being suitable to document the code snippet. In this paper, we perform a thorough empirical investigation on the complementarity of different types of metrics in capturing the quality of a generated summary. Also, we propose to address the limitations of existing metrics by considering a new dimension, capturing the extent to which the generated summary aligns with the semantics of the documented code snippet, independently from the reference summary. To this end, we present a new metric based on contrastive learning to capture said aspect. We empirically show that the inclusion of this novel dimension enables a more effective representation of developers' evaluations regarding the quality of automatically generated summaries.
翻译:文献中已提出多种代码摘要技术,用于自动记录代码片段或函数。理想情况下,软件开发者应参与评估所生成摘要的质量。然而,在大多数情况下,研究者依赖诸如 BLEU、ROUGE 和 METEOR 等自动评估指标。这些指标均基于同一假设:生成的摘要与开发者撰写的参考摘要之间的文本相似度越高,其质量越高。然而,该假设存在两点不足:(i) 参考摘要(例如通过挖掘软件仓库收集的代码注释)可能质量低下甚至过时;(ii) 生成的摘要即使措辞与参考摘要不同,也可能在语义上等价,因此仍适合用于记录代码片段。本文对不同类型指标在捕捉生成摘要质量方面的互补性进行了深入的实证研究。同时,我们提出通过考虑一个新维度来弥补现有指标的局限性:即捕捉生成摘要与所记录代码片段语义的一致性程度,且独立于参考摘要。为此,我们提出一种基于对比学习的新指标来捕捉该特性。实证结果表明,纳入这一新颖维度能够更有效地表征开发者对自动生成摘要质量的评估。