Ensuring the quality of software systems through testing is essential, yet maintaining test cases poses significant challenges and costs. The need for frequent updates to align with the evolving system under test often entails high complexity and cost for maintaining these test cases. Further, unrepaired broken test cases can degrade test suite quality and disrupt the software development process, wasting developers' time. To address this challenge, we present TaRGet (Test Repair GEneraTor), a novel approach leveraging pre-trained code language models for automated test case repair. TaRGet treats test repair as a language translation task, employing a two-step process to fine-tune a language model based on essential context data characterizing the test breakage. To evaluate our approach, we introduce TaRBench, a comprehensive benchmark we developed covering 45,373 broken test repairs across 59 open-source projects. Our results demonstrate TaRGet's effectiveness, achieving a 66.1% exact match accuracy. Furthermore, our study examines the effectiveness of TaRGet across different test repair scenarios. We provide a practical guide to predict situations where the generated test repairs might be less reliable. We also explore whether project-specific data is always necessary for fine-tuning and if our approach can be effective on new projects.
翻译:通过测试确保软件系统的质量至关重要,但维护测试用例却面临显著挑战和成本。测试用例需要频繁更新以适配被测系统的持续演进,这往往带来极高的复杂性和维护成本。此外,未修复的失效测试用例会降低测试套件质量,扰乱软件开发流程,并浪费开发人员时间。针对这一挑战,我们提出TaRGet(测试修复生成器)——一种利用预训练代码语言模型实现自动化测试用例修复的新方法。TaRGet将测试修复视为语言翻译任务,采用两步流程基于表征测试失效的关键上下文数据对语言模型进行微调。为评估该方法,我们构建了覆盖59个开源项目中45,373个失效测试修复的综合性基准测试集TaRBench。实验结果表明TaRGet具有显著效果,精确匹配准确率达到66.1%。我们进一步研究了TaRGet在不同测试修复场景下的有效性,提供实用指南用于预测生成修复可能不可靠的情形,同时探讨项目特定数据对微调的必要性以及该方法在新项目中的适用性。