Automatically detecting software failures is an important task and a longstanding challenge. It requires finding failure-inducing test cases whose test input can trigger the software's fault, and constructing an automated oracle to detect the software's incorrect behaviors. Recent advancement of large language models (LLMs) motivates us to study how far this challenge can be addressed by ChatGPT, a state-of-the-art LLM. Unfortunately, our study shows that ChatGPT has a low probability (28.8%) of finding correct failure-inducing test cases for buggy programs. A possible reason is that finding failure-inducing test cases requires analyzing the subtle code differences between a buggy program and its correct version. When these two versions have similar syntax, ChatGPT is weak at recognizing subtle code differences. Our insight is that ChatGPT's performance can be substantially enhanced when ChatGPT is guided to focus on the subtle code difference. We have an interesting observation that ChatGPT is effective in inferring the intended behaviors of a buggy program. The intended behavior can be leveraged to synthesize programs, in order to make the subtle code difference between a buggy program and its correct version (i.e., the synthesized program) explicit. Driven by this observation, we propose a novel approach that synergistically combines ChatGPT and differential testing to find failure-inducing test cases. We evaluate our approach on Quixbugs (a benchmark of buggy programs), and compare it with state-of-the-art baselines, including direct use of ChatGPT and Pynguin. The experimental result shows that our approach has a much higher probability (77.8%) of finding correct failure-inducing test cases, 2.7X as the best baseline.
翻译:自动检测软件失效是一项重要任务和长期挑战。该任务需要找到那些测试输入能触发软件故障的导致失效的测试用例,并构建自动化的预言器以检测软件的不正确行为。大型语言模型的最新进展促使我们研究,这一挑战在多大程度上能够通过当前最先进的大型语言模型ChatGPT得到解决。遗憾的是,我们的研究表明,ChatGPT对于存在缺陷的程序,找到正确的导致失效的测试用例的概率较低(28.8%)。一个可能的原因是,发现导致失效的测试用例需要分析缺陷程序与其正确版本之间的细微代码差异。当这两个版本具有相似语法时,ChatGPT在识别细微代码差异方面能力不足。我们的洞见在于,当引导ChatGPT聚焦于细微代码差异时,其性能可以得到显著提升。我们有一个有趣的发现:ChatGPT在推断缺陷程序的预期行为方面十分有效。预期行为可用于合成程序,从而使缺陷程序与其正确版本(即合成程序)之间的细微代码差异变得明确。基于这一观察,我们提出了一种新方法,将ChatGPT与差分测试协同结合,以发现导致失效的测试用例。我们在Quixbugs(一个缺陷程序基准测试集)上评估了该方法,并与直接使用ChatGPT和Pynguin等当前最优基线方法进行了比较。实验结果表明,我们的方法找到正确的导致失效的测试用例的概率更高(77.8%),是最优基线方法的2.7倍。