Software testing is a crucial but time-consuming aspect of software development, and recently, Large Language Models (LLMs) have gained popularity for automated test case generation. However, because LLMs are trained on vast amounts of open-source code, they often generate test cases that do not adhere to best practices and may even contain test smells (anti-patterns). To address this issue, we propose Reinforcement Learning from Static Quality Metrics (RLSQM), wherein we utilize Reinforcement Learning to generate high-quality unit tests based on static analysis-based quality metrics. First, we analyzed LLM-generated tests and show that LLMs frequently do generate undesirable test smells -- up to 37% of the time. Then, we implemented lightweight static analysis-based reward model and trained LLMs using this reward model to optimize for five code quality metrics. Our experimental results demonstrate that the RL-optimized Codex model consistently generated higher-quality test cases than the base LLM, improving quality metrics by up to 23%, and generated nearly 100% syntactically-correct code. RLSQM also outperformed GPT-4 on all code quality metrics, in spite of training a substantially cheaper Codex model. We provide insights into how reliably utilize RL to improve test generation quality and show that RLSQM is a significant step towards enhancing the overall efficiency and reliability of automated software testing. Our data are available at https://doi.org/10.6084/m9.figshare.25983166.
翻译:软件测试是软件开发中至关重要但耗时的一环,近期,大型语言模型(LLMs)在自动化测试用例生成方面日益普及。然而,由于LLMs是在海量开源代码上训练的,它们生成的测试用例常常不符合最佳实践,甚至可能包含测试异味(反模式)。为解决这一问题,我们提出了基于静态质量指标的强化学习方法(RLSQM),即利用强化学习,基于静态分析的质量指标来生成高质量的单元测试。首先,我们分析了LLM生成的测试,发现LLM确实频繁生成不良的测试异味——比例高达37%。随后,我们实现了一个基于轻量级静态分析的奖励模型,并使用该奖励模型训练LLMs,以优化五项代码质量指标。实验结果表明,经过RL优化的Codex模型持续生成比基础LLM更高质量的测试用例,质量指标提升最高达23%,且生成的代码语法正确率接近100%。尽管训练的是成本显著更低的Codex模型,RLSQM在所有代码质量指标上均优于GPT-4。我们深入探讨了如何可靠地利用RL提升测试生成质量,并表明RLSQM是提高自动化软件测试整体效率与可靠性的重要一步。相关数据可在 https://doi.org/10.6084/m9.figshare.25983166 获取。