Software testing is a core discipline in software engineering where a large array of research results has been produced, notably in the area of automatic test generation. Because existing approaches produce test cases that either can be qualified as simple (e.g. unit tests) or that require precise specifications, most testing procedures still rely on test cases written by humans to form test suites. Such test suites, however, are incomplete: they only cover parts of the project or they are produced after the bug is fixed. Yet, several research challenges, such as automatic program repair, and practitioner processes, build on the assumption that available test suites are sufficient. There is thus a need to break existing barriers in automatic test case generation. While prior work largely focused on random unit testing inputs, we propose to consider generating test cases that realistically represent complex user execution scenarios, which reveal buggy behaviour. Such scenarios are informally described in bug reports, which should therefore be considered as natural inputs for specifying bug-triggering test cases. In this work, we investigate the feasibility of performing this generation by leveraging large language models (LLMs) and using bug reports as inputs. Our experiments include the use of ChatGPT, as an online service, as well as CodeGPT, a code-related pre-trained LLM that was fine-tuned for our task. Overall, we experimentally show that bug reports associated to up to 50% of Defects4J bugs can prompt ChatGPT to generate an executable test case. We show that even new bug reports can indeed be used as input for generating executable test cases. Finally, we report experimental results which confirm that LLM-generated test cases are immediately useful in software engineering tasks such as fault localization as well as patch validation in automated program repair.
翻译:软件测试是软件工程中的核心学科,已产生大量研究成果,尤其是在自动测试生成领域。由于现有方法生成的测试用例要么可归类为简单(例如单元测试),要么需要精确的规格说明,多数测试流程仍依赖人工编写的测试用例构成测试套件。然而,此类测试套件存在不完整性:它们仅覆盖项目的部分内容,或在缺陷修复后才生成。尽管如此,诸如自动程序修复等研究挑战以及开发实践,都建立在现有测试套件足够完备的假设之上。因此,亟需打破自动测试用例生成中的现有障碍。以往研究主要聚焦于随机单元测试输入,本研究则提出生成能真实反映复杂用户执行场景(揭示缺陷行为)的测试用例。此类场景在缺陷报告中以非正式形式描述,因此应将其视为指定触发缺陷测试用例的自然输入。本研究探讨了利用大型语言模型(LLM)以缺陷报告为输入执行此生成的可行性。实验涉及使用在线服务ChatGPT以及为任务微调后的代码相关预训练LLM——CodeGPT。总体而言,实验表明,关联至Defects4J中最多50%缺陷的缺陷报告能引导ChatGPT生成可执行测试用例。我们证实,即使是新提交的缺陷报告也能作为生成可执行测试用例的有效输入。最终,实验结果表明,LLM生成的测试用例在软件工程任务(如故障定位及自动程序修复中的补丁验证)中具有即时实用性。