Code review is an essential activity for ensuring the quality and maintainability of software projects. However, it is a time-consuming and often error-prone task that can significantly impact the development process. Recently, ChatGPT, a cutting-edge language model, has demonstrated impressive performance in various natural language processing tasks, suggesting its potential to automate code review processes. However, it is still unclear how well ChatGPT performs in code review tasks. To fill this gap, in this paper, we conduct the first empirical study to understand the capabilities of ChatGPT in code review tasks, specifically focusing on automated code refinement based on given code reviews. To conduct the study, we select the existing benchmark CodeReview and construct a new code review dataset with high quality. We use CodeReviewer, a state-of-the-art code review tool, as a baseline for comparison with ChatGPT. Our results show that ChatGPT outperforms CodeReviewer in code refinement tasks. Specifically, our results show that ChatGPT achieves higher EM and BLEU scores of 22.78 and 76.44 respectively, while the state-of-the-art method achieves only 15.50 and 62.88 on a high-quality code review dataset. We further identify the root causes for ChatGPT's underperformance and propose several strategies to mitigate these challenges. Our study provides insights into the potential of ChatGPT in automating the code review process, and highlights the potential research directions.
翻译:代码审查是确保软件项目质量和可维护性的关键活动。然而,这是一项耗时且易出错的任务,可能显著影响开发流程。近期,作为尖端语言模型的ChatGPT在多种自然语言处理任务中展现出卓越性能,表明其具备自动化代码审查流程的潜力。但ChatGPT在代码审查任务中的具体表现仍不明确。为填补这一空白,本文首次通过实证研究理解ChatGPT在代码审查任务中的能力,尤其聚焦于基于给定审查意见的自动化代码优化。我们选取现有基准数据集CodeReview,并构建了高质量的新代码审查数据集。以当前最先进的代码审查工具CodeReviewer作为基线,与ChatGPT进行对比。结果显示,ChatGPT在代码优化任务中全面优于CodeReviewer。具体而言,在高质量数据集上,ChatGPT的EM和BLEU评分分别达到22.78和76.44,而最先进方法仅为15.50和62.88。我们进一步识别了导致ChatGPT性能不足的根源,并提出多项缓解策略。本研究揭示了ChatGPT在自动化代码审查流程中的潜力,并指明了潜在研究方向。