Code summarization facilitates program comprehension and software maintenance by converting code snippets into natural-language descriptions. Over the years, numerous methods have been developed for this task, but a key challenge remains: effectively evaluating the quality of generated summaries. While human evaluation is effective for assessing code summary quality, it is labor-intensive and difficult to scale. Commonly used automatic metrics, such as BLEU, ROUGE-L, METEOR, and BERTScore, often fail to align closely with human judgments. In this paper, we explore the potential of Large Language Models (LLMs) for evaluating code summarization. We propose CODERPE (Role-Player for Code Summarization Evaluation), a novel method that leverages role-player prompting to assess the quality of generated summaries. Specifically, we prompt an LLM agent to play diverse roles, such as code reviewer, code author, code editor, and system analyst. Each role evaluates the quality of code summaries across key dimensions, including coherence, consistency, fluency, and relevance. We further explore the robustness of LLMs as evaluators by employing various prompting strategies, including chain-of-thought reasoning, in-context learning, and tailored rating form designs. The results demonstrate that LLMs serve as effective evaluators for code summarization methods. Notably, our LLM-based evaluator, CODERPE , achieves an 81.59% Spearman correlation with human evaluations, outperforming the existing BERTScore metric by 17.27%.
翻译:代码摘要通过将代码片段转换为自然语言描述,促进程序理解和软件维护。多年来,已开发出多种方法用于此任务,但一个关键挑战仍然存在:如何有效评估生成摘要的质量。虽然人工评估能有效评判代码摘要质量,但其劳动密集且难以规模化。常用的自动评估指标,如BLEU、ROUGE-L、METEOR和BERTScore,往往与人类判断的契合度不高。本文探讨了大型语言模型(LLMs)在代码摘要评估中的应用潜力。我们提出了CODERPE(代码摘要评估角色扮演器),这是一种利用角色扮演提示来评估生成摘要质量的新方法。具体而言,我们提示一个LLM智能体扮演多种角色,例如代码审查员、代码作者、代码编辑者和系统分析师。每个角色从关键维度(包括连贯性、一致性、流畅性和相关性)评估代码摘要的质量。我们进一步通过采用多种提示策略(包括思维链推理、上下文学习以及定制的评分表设计)来探索LLMs作为评估器的鲁棒性。结果表明,LLMs可作为代码摘要方法的有效评估器。值得注意的是,我们基于LLM的评估器CODERPE与人工评估的Spearman相关性达到了81.59%,比现有BERTScore指标高出17.27%。