Code Review Automation Via Multi-task Federated LLM -- An Empirical Study

Code review is a crucial process before deploying code to production, as it validates the code, provides suggestions for improvements, and identifies errors such as missed edge cases. In projects with regular production releases, the effort required for peer code-reviews remains high. Consequently, there has been significant interest from software engineering (SE) researchers in automating the code review process. Previous research on code review automation has typically approached the task as three independent sub-tasks: review necessity prediction, review comment generation, and code refinement. Our study attempts to (i) leverage the relationships between the sub-tasks of code review automation, by developing a multi-task model that addresses all tasks in an integrated manner, and (ii) increase model robustness on unseen data via collaborative large language model (LLM) modeling, while retaining the proprietary nature of code, by using federated learning (FL). The study explores five simple techniques for multi-task training, including two sequential methods, one parallel method, and two cumulative methods. The results indicate that sequentially training a federated LLM (FedLLM) for our code review multi-task use case is less efficient in terms of time, computation, and performance metrics, compared to training separate models for each task. Because sequential training demonstrates catastrophic forgetting, alternatively cumulative fine-tuning for multi-task training performs better than training models for individual tasks. This study highlights the need for research focused on effective fine-tuning of multi-task FedLLMs for SE tasks.

翻译：代码审查是将代码部署至生产环境前的关键流程，它能验证代码质量、提供改进建议并识别诸如遗漏边界情况等错误。在具有定期生产版本发布的项目中，同行代码审查所需的工作量依然巨大。因此，软件工程（SE）研究领域对自动化代码审查流程产生了浓厚兴趣。以往关于代码审查自动化的研究通常将其视为三个独立子任务：审查必要性预测、审查意见生成和代码优化。本研究试图（i）通过开发以集成方式处理所有任务的多任务模型，利用代码审查自动化各子任务间的关联；（ii）在保留代码专有属性的前提下，通过联邦学习（FL）框架下的协作式大语言模型（LLM）建模提升模型对未见数据的鲁棒性。研究探索了五种多任务训练的简易技术，包括两种顺序方法、一种并行方法及两种累积方法。结果表明，针对代码审查多任务场景，顺序训练联邦大语言模型（FedLLM）在时间效率、计算成本和性能指标上均低于为各任务单独训练模型。由于顺序训练存在灾难性遗忘问题，采用累积微调的多任务训练策略表现优于单任务独立训练模型。本研究揭示了针对软件工程任务开展多任务联邦大语言模型高效微调研究的必要性。