Automatically detecting software failures is an important task and a longstanding challenge. It requires finding failure-inducing test cases whose test input can trigger the software's fault, and constructing an automated oracle to detect the software's incorrect behaviors. Recent advancement of large language models (LLMs) motivates us to study how far this challenge can be addressed by ChatGPT, a state-of-the-art LLM. Unfortunately, our study shows that ChatGPT has a low probability (28.8%) of finding correct failure-inducing test cases for buggy programs. A possible reason is that finding failure-inducing test cases requires analyzing the subtle code differences between a buggy program and its correct version. When these two versions have similar syntax, ChatGPT is weak at recognizing subtle code differences. Our insight is that ChatGPT's performance can be substantially enhanced when ChatGPT is guided to focus on the subtle code difference. We have an interesting observation that ChatGPT is effective in inferring the intended behaviors of a buggy program. The intended behavior can be leveraged to synthesize programs, in order to make the subtle code difference between a buggy program and its correct version (i.e., the synthesized program) explicit. Driven by this observation, we propose a novel approach that synergistically combines ChatGPT and differential testing to find failure-inducing test cases. We evaluate our approach on Quixbugs (a benchmark of buggy programs), and compare it with state-of-the-art baselines, including direct use of ChatGPT and Pynguin. The experimental result shows that our approach has a much higher probability (77.8%) of finding correct failure-inducing test cases, 2.7X as the best baseline.
翻译:自动检测软件缺陷是一项重要任务,也是一个长期存在的挑战。它需要找到能够触发软件故障的测试用例,并构建自动化的预言机来检测软件的不正确行为。大语言模型(LLMs)的最新进展促使我们研究,通过当前最先进的LLM——ChatGPT,能在多大程度上解决这一挑战。然而,我们的研究表明,ChatGPT 找到有缺陷程序的正确触发缺陷测试用例的概率较低(28.8%)。一个可能的原因是,发现触发缺陷的测试用例需要分析有缺陷程序与其正确版本之间的细微代码差异。当这两个版本的语法相似时,ChatGPT 在识别细微代码差异方面表现较弱。我们的洞察是,当引导 ChatGPT 关注这些细微代码差异时,其性能可以大幅提升。我们有一个有趣的观察:ChatGPT 在推断有缺陷程序预期行为方面非常有效。可以利用预期行为来合成程序,从而使有缺陷程序与其正确版本(即合成的程序)之间的细微代码差异变得明确。基于这一观察,我们提出了一种新方法,协同结合 ChatGPT 和差分测试来发现触发缺陷的测试用例。我们在 Quixbugs(一个有缺陷程序基准测试集)上评估了该方法,并与当前最先进的基线方法(包括直接使用 ChatGPT 和 Pynguin)进行了比较。实验结果表明,我们的方法找到正确触发缺陷测试用例的概率高得多(77.8%),是最佳基线的 2.7 倍。