With the increasing complexity of large-scale software systems, identifying all necessary modifications for a specific change is challenging. Co-changed methods, which are methods frequently modified together, are crucial for understanding software dependencies. However, existing methods often produce large results with high false positives. Focusing on pull requests instead of individual commits provides a more comprehensive view of related changes, capturing essential co-change relationships. To address these challenges, we propose a learning-to-rank approach that combines source code features and change history to predict and rank co-changed methods at the pull-request level. Experiments on 150 open-source Java projects, totaling 41.5 million lines of code and 634,216 pull requests, show that the Random Forest model outperforms other models by 2.5 to 12.8 percent in NDCG@5. It also surpasses baselines such as file proximity, code clones, FCP2Vec, and StarCoder 2 by 4.7 to 537.5 percent. Models trained on longer historical data (90 to 180 days) perform consistently, while accuracy declines after 60 days, highlighting the need for bi-monthly retraining. This approach provides an effective tool for managing co-changed methods, enabling development teams to handle dependencies and maintain software quality.
翻译:随着大规模软件系统复杂性的日益增加,识别特定变更所需的所有必要修改具有挑战性。协同变更方法,即经常一起被修改的方法,对于理解软件依赖关系至关重要。然而,现有方法通常会产生大量结果且误报率高。关注拉取请求而非单个提交,能够提供更全面的相关变更视图,从而捕捉到关键的协同变更关系。为应对这些挑战,我们提出了一种排序学习方法,该方法结合源代码特征和变更历史,在拉取请求级别预测并排序协同变更方法。在150个开源Java项目(总计4150万行代码和634,216个拉取请求)上进行的实验表明,Random Forest模型在NDCG@5指标上优于其他模型2.5%至12.8%。同时,其性能也超越了文件邻近性、代码克隆、FCP2Vec和StarCoder 2等基线方法4.7%至537.5%。基于较长历史数据(90至180天)训练的模型表现稳定,而60天后准确性开始下降,这凸显了每两个月重新训练模型的必要性。该方法为管理协同变更方法提供了有效工具,使开发团队能够处理依赖关系并维持软件质量。