Bug reproduction is a critical developer activity that is also challenging to automate, as bug reports are often in natural language and thus can be difficult to transform to test cases consistently. As a result, existing techniques mostly focused on crash bugs, which are easier to automatically detect and verify. In this work, we overcome this limitation by using large language models (LLMs), which have been demonstrated to be adept at natural language processing and code generation. By prompting LLMs to generate bug-reproducing tests, and via a post-processing pipeline to automatically identify promising generated tests, our proposed technique LIBRO could successfully reproduce about one-third of all bugs in the widely used Defects4J benchmark. Furthermore, our extensive evaluation on 15 LLMs, including 11 open-source LLMs, suggests that open-source LLMs also demonstrate substantial potential, with the StarCoder LLM achieving 70% of the reproduction performance of the closed-source OpenAI LLM code-davinci-002 on the large Defects4J benchmark, and 90% of performance on a held-out bug dataset likely not part of any LLM's training data. In addition, our experiments on LLMs of different sizes show that bug reproduction using LIBRO improves as LLM size increases, providing information as to which LLMs can be used with the LIBRO pipeline.
翻译:缺陷复现是开发者的一项重要活动,但实现自动化极具挑战性,因为缺陷报告通常以自然语言描述,难以一致地转化为测试用例。因此,现有技术主要集中于崩溃类缺陷——这类缺陷更容易自动检测与验证。本研究通过利用已证明在自然语言处理和代码生成方面表现出色的大型语言模型(LLMs)克服了这一局限。通过提示LLMs生成可复现缺陷的测试用例,并设计后处理流水线自动筛选有潜力的生成测试,我们提出的LIBRO技术能够在广泛使用的Defects4J基准测试中成功复现约三分之一的缺陷。此外,我们在15个LLMs(包括11个开源模型)上的全面评估表明,开源LLMs同样展现出巨大潜力:在大型Defects4J基准测试中,StarCoder LLM的复现性能达到闭源OpenAI LLM code-davinci-002的70%;在可能未纳入任何LLM训练数据的独立缺陷数据集中,其复现性能更达到90%。进一步,对不同规模LLMs的实验显示,采用LIBRO流程进行缺陷复现的性能随LLM规模提升而增强,这为选择适配LIBRO流水线的LLMs提供了依据。