研究人工与自动化代码审查推荐的质量改进 (Studying Quality Improvements Recommended via Manual and Automated Code Review)

Several Deep Learning (DL)-based techniques have been proposed to automate code review. Still, it is unclear the extent to which these approaches can recommend quality improvements as a human reviewer. We study the similarities and differences between code reviews performed by humans and those automatically generated by DL models, using ChatGPT-4 as representative of the latter. In particular, we run a mining-based study in which we collect and manually inspect 739 comments posted by human reviewers to suggest code changes in 240 PRs. The manual inspection aims at classifying the type of quality improvement recommended by human reviewers (e.g., rename variable/constant). Then, we ask ChatGPT to perform a code review on the same PRs and we compare the quality improvements it recommends against those suggested by the human reviewers. We show that while, on average, ChatGPT tends to recommend a higher number of code changes as compared to human reviewers (~2.4x more), it can only spot 10% of the quality issues reported by humans. However, ~40% of the additional comments generated by the LLM point to meaningful quality issues. In short, our findings show the complementarity of manual and AI-based code review. This finding suggests that, in its current state, DL-based code review can be used as a further quality check on top of the one performed by humans, but should not be considered as a valid alternative to them nor as a mean to save code review time, since human reviewers would still need to perform their manual inspection while also validating the quality issues reported by the DL-based technique.

翻译：已有多种基于深度学习（DL）的技术被提出以自动化代码审查。然而，这些方法能在多大程度上像人类审查者一样推荐质量改进尚不明确。我们以 ChatGPT-4 作为后者的代表，研究了人类执行的代码审查与深度学习模型自动生成的代码审查之间的相似性和差异。具体而言，我们开展了一项基于挖掘的研究，收集并人工检查了 240 个拉取请求中由人类审查者发布的 739 条建议代码更改的评论。人工检查旨在对人类审查者推荐的质量改进类型（例如，重命名变量/常量）进行分类。然后，我们要求 ChatGPT 对相同的拉取请求进行代码审查，并将其推荐的质量改进与人类审查者建议的进行比较。我们发现，虽然 ChatGPT 平均倾向于推荐比人类审查者更多的代码更改（约多 2.4 倍），但它只能发现人类报告的 10% 的质量问题。然而，该大语言模型生成的额外评论中约有 40% 指出了有意义的质问题。简而言之，我们的研究结果表明了人工与基于人工智能的代码审查具有互补性。这一发现表明，在当前状态下，基于深度学习的代码审查可以作为人类审查之上的一项额外质量检查手段，但不应被视为人类审查的有效替代方案，也不应被视为节省代码审查时间的方法，因为人类审查者仍然需要进行人工检查，同时还需验证基于深度学习技术报告的质量问题。