Software testing is a crucial aspect of software development, and the creation of high-quality tests that adhere to best practices is essential for effective maintenance. Recently, Large Language Models (LLMs) have gained popularity for code generation, including the automated creation of test cases. However, these LLMs are often trained on vast amounts of publicly available code, which may include test cases that do not adhere to best practices and may even contain test smells (anti-patterns). To address this issue, we propose a novel technique called Reinforcement Learning from Static Quality Metrics (RLSQM). To begin, we analyze the anti-patterns generated by the LLM and show that LLMs can generate undesirable test smells. Thus, we train specific reward models for each static quality metric, then utilize Proximal Policy Optimization (PPO) to train models for optimizing a single quality metric at a time. Furthermore, we amalgamate these rewards into a unified reward model aimed at capturing different best practices and quality aspects of tests. By comparing RL-trained models with those trained using supervised learning, we provide insights into how reliably utilize RL to improve test generation quality and into the effects of various training strategies. Our experimental results demonstrate that the RL-optimized model consistently generated high-quality test cases compared to the base LLM, improving the model by up to 21%, and successfully generates nearly 100% syntactically correct code. RLSQM also outperformed GPT-4 on four out of seven metrics. This represents a significant step towards enhancing the overall efficiency and reliability of software testing through Reinforcement Learning and static quality metrics. Our data are available at this link: https://figshare.com/s/ded476c8d4c221222849.
翻译:软件测试是软件开发的关键环节,创建遵循最佳实践的高质量测试对于有效维护至关重要。近年来,大型语言模型(LLM)在代码生成领域备受青睐,包括测试用例的自动创建。然而,这些LLM通常基于大量公开代码进行训练,其中可能包含不遵循最佳实践的测试用例,甚至存在测试坏味(反模式)。为解决此问题,我们提出一种名为基于静态质量指标的强化学习(RLSQM)的新技术。首先,我们分析LLM生成的反模式,并表明LLM可能产生不良的测试坏味。因此,我们针对每个静态质量指标训练特定的奖励模型,随后利用近端策略优化(PPO)训练模型以一次优化单一质量指标。此外,我们将这些奖励合并为统一的奖励模型,旨在捕获测试的不同最佳实践和质量方面。通过将强化学习训练模型与监督学习训练模型进行比较,我们深入探讨了如何可靠地利用强化学习提升测试生成质量,以及不同训练策略的影响。实验结果表明,与基础LLM相比,经强化学习优化的模型持续生成高质量测试用例,模型性能提升高达21%,并成功生成了近乎100%的语法正确代码。RLSQM在七项指标中的四项上优于GPT-4。这标志着通过强化学习与静态质量指标提升软件测试整体效率与可靠性迈出了重要一步。实验数据可通过此链接获取:https://figshare.com/s/ded476c8d4c221222849。