Automatically detecting software failures is an important task and a longstanding challenge. It requires finding failure-inducing test cases whose test input can trigger the software's fault, and constructing an automated oracle to detect the software's incorrect behaviors. Recent advancement of large language models (LLMs) motivates us to study how far this challenge can be addressed by ChatGPT, a state-of-the-art LLM. Unfortunately, our study shows that ChatGPT has a low probability (28.8%) of finding correct failure-inducing test cases for buggy programs. A possible reason is that finding failure-inducing test cases requires analyzing the subtle code differences between a buggy program and its correct version. When these two versions have similar syntax, ChatGPT is weak at recognizing subtle code differences. Our insight is that ChatGPT's performance can be substantially enhanced when ChatGPT is guided to focus on the subtle code difference. We have an interesting observation that ChatGPT is effective in inferring the intended behaviors of a buggy program. The intended behavior can be leveraged to synthesize programs, in order to make the subtle code difference between a buggy program and its correct version (i.e., the synthesized program) explicit. Driven by this observation, we propose a novel approach that synergistically combines ChatGPT and differential testing to find failure-inducing test cases. We evaluate our approach on Quixbugs (a benchmark of buggy programs), and compare it with state-of-the-art baselines, including direct use of ChatGPT and Pynguin. The experimental result shows that our approach has a much higher probability (77.8%) of finding correct failure-inducing test cases, 2.7X as the best baseline.
翻译:自动检测软件故障是一项重要任务,也是长期存在的挑战。它需要寻找其测试输入能触发软件错误的“诱发失败的测试用例”,并构建自动化测试预言以检测软件的不正确行为。最新进展中,大语言模型(LLMs)的发展促使我们研究这一挑战能被当前最先进的LLM——ChatGPT——解决到什么程度。不幸的是,我们的研究表明,ChatGPT为存在缺陷的程序找到正确诱发失败测试用例的概率较低(28.8%)。一个可能的原因是,寻找诱发失败的测试用例需要分析缺陷程序与其正确版本之间的细微代码差异。当这两个版本具有相似的语法时,ChatGPT在识别细微代码差异方面能力较弱。我们的见解是:当引导ChatGPT关注细微代码差异时,其性能可得到显著提升。我们有一个有趣的发现:ChatGPT在推断缺陷程序的预期行为方面十分有效。这种预期行为可用于合成程序,从而使缺陷程序与其正确版本(即合成程序)之间的细微代码差异变得明确。受此观察启发,我们提出了一种新颖的方法,将ChatGPT与差分测试协同结合,以寻找诱发失败的测试用例。我们在Quixbugs(一个缺陷程序基准测试集)上评估了该方法,并将其与最新基线方法(包括直接使用ChatGPT和Pynguin)进行了比较。实验结果表明,我们的方法找到正确诱发失败测试用例的概率大幅提升至77.8%,是最优基线的2.7倍。