We conduct a large-scale empirical user study in a live setup to evaluate the acceptance of LLM-generated comments and their impact on the review process. This user study was performed in two organizations, Mozilla (which has its codebase available as open source) and Ubisoft (fully closed-source). Inside their usual review environment, participants were given access to RevMate, an LLM-based assistive tool suggesting generated review comments using an off-the-shelf LLM with Retrieval Augmented Generation to provide extra code and review context, combined with LLM-as-a-Judge, to auto-evaluate the generated comments and discard irrelevant cases. Based on more than 587 patch reviews provided by RevMate, we observed that 8.1% and 7.2%, respectively, of LLM-generated comments were accepted by reviewers in each organization, while 14.6% and 20.5% other comments were still marked as valuable as review or development tips. Refactoring-related comments are more likely to be accepted than Functional comments (18.2% and 18.6% compared to 4.8% and 5.2%). The extra time spent by reviewers to inspect generated comments or edit accepted ones (36/119), yielding an overall median of 43s per patch, is reasonable. The accepted generated comments are as likely to yield future revisions of the revised patch as human-written comments (74% vs 73% at chunk-level).
翻译:我们在实际环境中进行了一项大规模实证用户研究,以评估LLM生成评论的接受度及其对评审过程的影响。该用户研究在两个组织内进行:Mozilla(其代码库为开源)和Ubisoft(完全闭源)。在参与者日常的评审环境中,我们为其提供了RevMate——一款基于LLM的辅助工具。该工具采用现成的LLM结合检索增强生成技术,提供额外的代码和评审上下文,并集成LLM-as-a-Judge机制来自动评估生成的评论并过滤无关内容。基于RevMate提供的超过587个补丁评审数据,我们观察到在两个组织中分别有8.1%和7.2%的LLM生成评论被评审者采纳,同时另有14.6%和20.5%的评论被标记为有价值的评审或开发建议。与功能相关评论相比,重构相关评论的采纳率更高(分别为18.2%/18.6%对比4.8%/5.2%)。评审者用于检查生成评论或编辑采纳评论的额外时间(36/119条)总体中位数为每个补丁43秒,处于合理范围。被采纳的生成评论与人工撰写评论在引发后续补丁修订的可能性上相当(代码块层面74%对比73%)。