Bug reproduction is a critical developer activity that is also challenging to automate, as bug reports are often in natural language and thus can be difficult to transform to test cases consistently. As a result, existing techniques mostly focused on crash bugs, which are easier to automatically detect and verify. In this work, we overcome this limitation by using large language models (LLMs), which have been demonstrated to be adept at natural language processing and code generation. By prompting LLMs to generate bug-reproducing tests, and via a post-processing pipeline to automatically identify promising generated tests, our proposed technique LIBRO could successfully reproduce about one-third of all bugs in the widely used Defects4J benchmark. Furthermore, our extensive evaluation on 15 LLMs, including 11 open-source LLMs, suggests that open-source LLMs also demonstrate substantial potential, with the StarCoder LLM achieving 70% of the reproduction performance of the closed-source OpenAI LLM code-davinci-002 on the large Defects4J benchmark, and 90% of performance on a held-out bug dataset likely not part of any LLM's training data. In addition, our experiments on LLMs of different sizes show that bug reproduction using LIBRO improves as LLM size increases, providing information as to which LLMs can be used with the LIBRO pipeline.
翻译:缺陷复现是开发人员的关键活动,但自动化过程极具挑战性,因为缺陷报告通常以自然语言形式呈现,难以一致地转换为测试用例。因此,现有技术主要聚焦于易自动检测和验证的崩溃类缺陷。本研究通过利用在自然语言处理和代码生成方面表现出色的大型语言模型(LLMs)突破了这一限制。通过引导LLMs生成缺陷复现测试,并设计后处理流程自动识别有前景的生成测试,我们提出的LIBRO技术在广泛使用的Defects4J基准测试中成功复现了约三分之一的缺陷。此外,我们对15个LLMs(含11个开源模型)的全面评估表明,开源LLMs同样展现出巨大潜力:在大型Defects4J基准测试中,StarCoder LLM的复现性能达到闭源OpenAI LLM code-davinci-002的70%;在独立保留的、可能未参与任何LLM训练的缺陷数据集上,该指标提升至90%。同时,不同规模LLMs的实验显示,LIBRO的缺陷复现效果随模型规模增大而提升,这为确定适用于LIBRO流程的LLMs提供了依据。