Finding Failure-Inducing Test Cases with ChatGPT

Automatically detecting software failures is an important task and a longstanding challenge. It requires finding failure-inducing test cases whose test input can trigger the software's fault, and constructing an automated oracle to detect the software's incorrect behaviors. Recent advancement of large language models (LLMs) motivates us to study how far this challenge can be addressed by ChatGPT, a state-of-the-art LLM. Unfortunately, our study shows that ChatGPT has a low probability (28.8%) of finding correct failure-inducing test cases for buggy programs. A possible reason is that finding failure-inducing test cases requires analyzing the subtle code differences between a buggy program and its correct version. When these two versions have similar syntax, ChatGPT is weak at recognizing subtle code differences. Our insight is that ChatGPT's performance can be substantially enhanced when ChatGPT is guided to focus on the subtle code difference. We have an interesting observation that ChatGPT is effective in inferring the intended behaviors of a buggy program. The intended behavior can be leveraged to synthesize programs, in order to make the subtle code difference between a buggy program and its correct version (i.e., the synthesized program) explicit. Driven by this observation, we propose a novel approach that synergistically combines ChatGPT and differential testing to find failure-inducing test cases. We evaluate our approach on Quixbugs (a benchmark of buggy programs), and compare it with state-of-the-art baselines, including direct use of ChatGPT and Pynguin. The experimental result shows that our approach has a much higher probability (77.8%) of finding correct failure-inducing test cases, 2.7X as the best baseline.

翻译：自动检测软件故障是一项重要任务，也是一个长期挑战。这需要找到能够触发软件缺陷的引发失败的测试用例，并构建自动化的预言机来检测软件的不正确行为。大语言模型的最新进展促使我们研究如何通过当前最先进的大语言模型ChatGPT来解决这一挑战。不幸的是，我们的研究表明，ChatGPT在发现错误程序的正确引发失败的测试用例方面概率较低（28.8%）。一个可能的原因是，发现引发失败的测试用例需要分析错误程序与其正确版本之间的细微代码差异。当这两个版本具有相似语法时，ChatGPT在识别细微代码差异方面表现较弱。我们的见解是，当引导ChatGPT聚焦于细微代码差异时，其性能可以显著提升。我们有一个有趣的观察：ChatGPT在推断错误程序的预期行为方面效果显著。预期行为可用于合成程序，从而显式化错误程序与其正确版本（即合成程序）之间的细微代码差异。基于这一观察，我们提出了一种新颖的方法，协同结合ChatGPT与差异测试来发现引发失败的测试用例。我们在Quixbugs（一个错误程序基准测试集）上评估了我们的方法，并与最先进的基线方法（包括直接使用ChatGPT和Pynguin）进行了比较。实验结果表明，我们的方法在发现正确引发失败的测试用例方面概率更高（77.8%），是最佳基线方法的2.7倍。

相关内容

CASES

关注 4

CASES：International Conference on Compilers, Architectures, and Synthesis for Embedded Systems。 Explanation：嵌入式系统编译器、体系结构和综合国际会议。 Publisher：ACM。 SIT： http://dblp.uni-trier.de/db/conf/cases/index.html

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日