During code reviews, an essential step in software quality assurance, reviewers have the difficult task of understanding and evaluating code changes to validate their quality and prevent introducing faults to the codebase. This is a tedious process where the effort needed is highly dependent on the code submitted, as well as the author's and the reviewer's experience, leading to median wait times for review feedback of 15-64 hours. Through an initial user study carried with 29 experts, we found that re-ordering the files changed by a patch within the review environment has potential to improve review quality, as more comments are written (+23%), and participants' file-level hot-spot precision and recall increases to 53% (+13%) and 28% (+8%), respectively, compared to the alphanumeric ordering. Hence, this paper aims to help code reviewers by predicting which files in a submitted patch need to be (1) commented, (2) revised, or (3) are hot-spots (commented or revised). To predict these tasks, we evaluate two different types of text embeddings (i.e., Bag-of-Words and Large Language Models encoding) and review process features (i.e., code size-based and history-based features). Our empirical study on three open-source and two industrial datasets shows that combining the code embedding and review process features leads to better results than the state-of-the-art approach. For all tasks, F1-scores (median of 40-62%) are significantly better than the state-of-the-art (from +1 to +9%).
翻译:在代码审查这一软件质量保证的关键环节中,审阅者需理解并评估代码变更以验证其质量、防止缺陷引入代码库,这是一项艰巨任务。该过程繁琐且所需工作量高度依赖提交的代码以及作者和审阅者的经验,导致审查反馈的中位等待时间长达15-64小时。通过对29位专家开展的初步用户研究,我们发现:与按字母顺序排序相比,在审查环境中重新排列补丁中修改的文件顺序具有提升审查质量的潜力——参与者对热点文件的精确率与召回率分别提升至53%(+13%)和28%(+8%),同时审查意见数量增加23%。因此,本文旨在通过预测提交补丁中哪些文件需要(1)被评论、(2)被修改或(3)成为热点(被评论或修改)来辅助代码审阅者。为预测这些任务,我们评估了两种文本嵌入方法(词袋模型与大型语言模型编码)以及审查流程特征(基于代码规模与历史记录的特征)。基于三个开源数据集与两个工业数据集的实证研究表明:将代码嵌入与审查流程特征相结合的效果优于现有最优方法。针对所有任务,F1分数中位数(40-62%)均显著高于现有最优方法(提升1-9%)。