Automatically detecting software failures is an important task and a longstanding challenge. It requires finding failure-inducing test cases whose test input can trigger the software's fault, and constructing an automated oracle to detect the software's incorrect behaviors. Recent advancement of large language models (LLMs) motivates us to study how far this challenge can be addressed by ChatGPT, a state-of-the-art LLM. Unfortunately, our study shows that ChatGPT has a low probability (28.8%) of finding correct failure-inducing test cases for buggy programs. A possible reason is that finding failure-inducing test cases requires analyzing the subtle code differences between a buggy program and its correct version. When these two versions have similar syntax, ChatGPT is weak at recognizing subtle code differences. Our insight is that ChatGPT's performance can be substantially enhanced when ChatGPT is guided to focus on the subtle code difference. We have an interesting observation that ChatGPT is effective in inferring the intended behaviors of a buggy program. The intended behavior can be leveraged to synthesize programs, in order to make the subtle code difference between a buggy program and its correct version (i.e., the synthesized program) explicit. Driven by this observation, we propose a novel approach that synergistically combines ChatGPT and differential testing to find failure-inducing test cases. We evaluate our approach on Quixbugs (a benchmark of buggy programs), and compare it with state-of-the-art baselines, including direct use of ChatGPT and Pynguin. The experimental result shows that our approach has a much higher probability (77.8%) of finding correct failure-inducing test cases, 2.7X as the best baseline.
翻译:自动检测软件故障是一项重要任务,也是一个长期挑战。这需要找到能够触发软件缺陷的引发失败的测试用例,并构建自动化的预言机来检测软件的不正确行为。大语言模型的最新进展促使我们研究如何通过当前最先进的大语言模型ChatGPT来解决这一挑战。不幸的是,我们的研究表明,ChatGPT在发现错误程序的正确引发失败的测试用例方面概率较低(28.8%)。一个可能的原因是,发现引发失败的测试用例需要分析错误程序与其正确版本之间的细微代码差异。当这两个版本具有相似语法时,ChatGPT在识别细微代码差异方面表现较弱。我们的见解是,当引导ChatGPT聚焦于细微代码差异时,其性能可以显著提升。我们有一个有趣的观察:ChatGPT在推断错误程序的预期行为方面效果显著。预期行为可用于合成程序,从而显式化错误程序与其正确版本(即合成程序)之间的细微代码差异。基于这一观察,我们提出了一种新颖的方法,协同结合ChatGPT与差异测试来发现引发失败的测试用例。我们在Quixbugs(一个错误程序基准测试集)上评估了我们的方法,并与最先进的基线方法(包括直接使用ChatGPT和Pynguin)进行了比较。实验结果表明,我们的方法在发现正确引发失败的测试用例方面概率更高(77.8%),是最佳基线方法的2.7倍。