To support software developers in understanding and maintaining programs, various automatic code summarization techniques have been proposed to generate a concise natural language comment for a given code snippet. Recently, the emergence of large language models (LLMs) has led to a great boost in the performance of natural language processing tasks. Among them, ChatGPT is the most popular one which has attracted wide attention from the software engineering community. However, it still remains unclear how ChatGPT performs in (automatic) code summarization. Therefore, in this paper, we focus on evaluating ChatGPT on a widely-used Python dataset called CSN-Python and comparing it with several state-of-the-art (SOTA) code summarization models. Specifically, we first explore an appropriate prompt to guide ChatGPT to generate in-distribution comments. Then, we use such a prompt to ask ChatGPT to generate comments for all code snippets in the CSN-Python test set. We adopt three widely-used metrics (including BLEU, METEOR, and ROUGE-L) to measure the quality of the comments generated by ChatGPT and SOTA models (including NCS, CodeBERT, and CodeT5). The experimental results show that in terms of BLEU and ROUGE-L, ChatGPT's code summarization performance is significantly worse than all three SOTA models. We also present some cases and discuss the advantages and disadvantages of ChatGPT in code summarization. Based on the findings, we outline several open challenges and opportunities in ChatGPT-based code summarization.
翻译:为了支持软件开发人员理解和维护程序,研究者提出了多种自动代码摘要技术,旨在为给定代码片段生成简洁的自然语言注释。近年来,大型语言模型(LLM)的出现显著提升了自然语言处理任务的性能,其中ChatGPT作为最受欢迎的模型,已引起软件工程社区的广泛关注。然而,ChatGPT在(自动)代码摘要中的表现仍不明确。为此,本文聚焦于评估ChatGPT在广泛使用的Python数据集CSN-Python上的表现,并将其与多个最先进的代码摘要模型进行对比。具体而言,我们首先探索合适的提示词以引导ChatGPT生成符合分布的注释,随后使用该提示词要求ChatGPT为CSN-Python测试集中的所有代码片段生成注释。我们采用三个广泛使用的指标(包括BLEU、METEOR和ROUGE-L)衡量ChatGPT与最先进模型(包括NCS、CodeBERT和CodeT5)生成注释的质量。实验结果表明:在BLEU和ROUGE-L指标上,ChatGPT的代码摘要性能显著劣于所有三个最先进模型。我们还展示了一些案例,讨论了ChatGPT在代码摘要中的优势与不足。基于研究结果,我们概述了基于ChatGPT的代码摘要所面临的若干开放挑战与机遇。