It is natural to suppose that a Large Language Model is more likely to generate correct test cases when prompted with correct code under test, compared to incorrect code under test. However, the size of this effect has never been previously measured, despite its obvious importance for both practicing software engineers and researchers. To answer the question, we conducted a comprehensive empirical study on 5 open source and 6 closed source language models, with 3 widely-used benchmark data sets together with 41 repo-level real-world examples from two different real-world data sets. Our results reveal that, when compared to incorrect code under test, LLMs prompted with correct code achieve improvements in test accuracy, code coverage, and bug detection of 57\%, 12\%, and 24\% respectively. We further show that these scientific conclusions carry over from the three benchmark data sets to the real-world code, where tests generated for incorrect code experience a 47\% worse bug detection rate. Finally, we report that improvements of +18\% in accuracy, +4\% coverage, and +34\% in bug detection can be achieved by providing natural language code descriptions. These findings have actionable conclusions. For example, the 47\% reduction in real-world bug detection is a clear concern. Fortunately, it is a concern for which our findings about the added value of descriptions offer an immediately actionable remedy.
翻译:人们自然会认为,与使用被测错误代码相比,大型语言模型在被测正确代码提示下更有可能生成正确的测试用例。然而,尽管这一效应对软件工程实践者和研究者都具有显而易见的重要性,其影响程度此前从未被量化。为回答这一问题,我们对5个开源和6个闭源语言模型开展了全面的实证研究,使用了3个广泛采用的基准数据集以及来自两个不同真实世界数据集的41个仓库级实际案例。研究结果表明:与基于被测错误代码的提示相比,基于被测正确代码提示的LLM在测试准确率、代码覆盖率和缺陷检测率上分别实现了57%、12%和24%的提升。我们进一步证明,这些科学结论从三个基准数据集延续到了真实世界代码中——针对错误代码生成的测试其缺陷检测率降低了47%。最后,我们发现通过提供自然语言代码描述,可以在准确率、覆盖率和缺陷检测率上分别实现+18%、+4%和+34%的提升。这些发现具有可操作的结论:例如真实世界缺陷检测率47%的下降是明确的警示,而值得庆幸的是,我们关于代码描述附加价值的发现为此提供了立即可行的补救方案。