理解评审智能体AI生成代码中的主导主题 (Understanding Dominant Themes in Reviewing Agentic AI-authored Code)

While prior work has examined the generation capabilities of Agentic AI systems, little is known about how reviewers respond to AI-authored code in practice. In this paper, we present a large-scale empirical study of code review dynamics in agent-generated PRs. Using a curated subset of the AIDev dataset, we analyze 19,450 inline review comments spanning 3,177 agent-authored PRs from real-world GitHub repositories. We first derive a taxonomy of 12 review comment themes using topic modeling combined with large language model (LLM)-assisted semantic clustering and consolidation. According to this taxonomy, we then investigate whether zero-shot prompts to LLM can reliably annotate review comments. Our evaluation against human annotations shows that open-source LLM achieves reasonably high exact match (78.63%), macro F1 score (0.78), and substantial agreement with human annotators at the review comment level. At the PR level, the LLM also correctly identifies the dominant review theme with 78% Top-1 accuracy and achieves an average Jaccard similarity of 0.76, indicating strong alignment with human judgments. Applying this annotation pipeline at scale, we find that apart from functional correctness and logical changes, reviews of agent-authored PRs predominantly focus on documentation gaps, refactoring needs, styling and formatting issues, with testing and security-related concerns. These findings suggest that while AI agents can accelerate code production, there remain gaps requiring targeted human review oversight.

翻译：尽管先前研究已探讨了智能体AI系统的代码生成能力，但实践中评审人员如何应对AI生成代码的问题仍鲜为人知。本文通过大规模实证研究，分析了智能体生成拉取请求（PR）中的代码评审动态。基于AIDev数据集的精选子集，我们分析了来自真实GitHub仓库的3,177个智能体生成PR中的19,450条行内评审评论。首先，通过主题建模结合大语言模型（LLM）辅助的语义聚类与整合，我们构建了包含12类评审评论主题的分类体系。依据该分类体系，我们进一步探究了零样本提示能否使LLM可靠地标注评审评论。通过与人工标注的对比评估，开源LLM在评审评论层级达到了较高的精确匹配率（78.63%）、宏观F1分数（0.78），且与人工标注者具有显著一致性。在PR层级，LLM以78%的Top-1准确率正确识别主导评审主题，平均杰卡德相似度达0.76，表明其与人类判断高度吻合。通过规模化应用该标注流程，我们发现除功能正确性与逻辑修改外，针对智能体生成PR的评审主要聚焦于文档缺失、重构需求、代码风格与格式问题，以及测试和安全相关事项。这些发现表明，虽然AI智能体能够加速代码生产，但仍存在需要针对性人工评审监督的不足。