The question of how to generate high-utility mutations, to be used for testing purposes, forms a key challenge in mutation testing literature. %Existing approaches rely either on human-specified syntactic rules or learning-based approaches, all of which produce large numbers of redundant mutants. Large Language Models (LLMs) have shown great potential in code-related tasks but their utility in mutation testing remains unexplored. To this end, we systematically investigate the performance of LLMs in generating effective mutations w.r.t. to their usability, fault detection potential, and relationship with real bugs. In particular, we perform a large-scale empirical study involving 4 LLMs, including both open- and closed-source models, and 440 real bugs on two Java benchmarks. We find that compared to existing approaches, LLMs generate more diverse mutations that are behaviorally closer to real bugs, which leads to approximately 18% higher fault detection than current approaches (i.e., 87% vs. 69%) in a newly collected set of bugs, purposely selected for evaluating learning-based approaches, i.e., mitigating potential data leakage concerns. Additionally, we explore alternative prompt engineering strategies and the root causes of uncompilable mutations, produced by the LLMs, and provide valuable insights for the use of LLMs in the context of mutation testing.
翻译:如何生成可用于测试的高效用变异体,构成了变异测试文献中的一个关键挑战。现有方法要么依赖人工指定的语法规则,要么采用基于学习的方法,这些方法都会产生大量冗余变异体。大语言模型(LLMs)在代码相关任务中展现出巨大潜力,但其在变异测试中的应用价值尚未得到探索。为此,我们系统性地研究了LLMs在生成有效变异体方面的性能,重点关注其可用性、缺陷检测潜力以及与真实缺陷的关联。具体而言,我们开展了一项大规模实证研究,涉及4个LLMs(包括开源和闭源模型)以及两个Java基准测试中的440个真实缺陷。研究发现,与现有方法相比,LLMs生成的变异体更具多样性,其行为更接近真实缺陷,这导致在新收集的缺陷集中(该集合专为评估基于学习的方法而选取,即缓解潜在的数据泄露问题)的缺陷检测率比现有方法高出约18%(即87%对69%)。此外,我们探索了替代的提示工程策略,分析了LLMs生成不可编译变异体的根本原因,并为在变异测试背景下使用LLMs提供了有价值的见解。