Automated Bug Generation in the era of Large Language Models

Bugs are essential in software engineering; many research studies in the past decades have been proposed to detect, localize, and repair bugs in software systems. Effectiveness evaluation of such techniques requires complex bugs, i.e., those that are hard to detect through testing and hard to repair through debugging. From the classic software engineering point of view, a hard-to-repair bug differs from the correct code in multiple locations, making it hard to localize and repair. Hard-to-detect bugs, on the other hand, manifest themselves under specific test inputs and reachability conditions. These two objectives, i.e., generating hard-to-detect and hard-to-repair bugs, are mostly aligned; a bug generation technique can change multiple statements to be covered only under a specific set of inputs. However, these two objectives are conflicting for learning-based techniques: A bug should have a similar code representation to the correct code in the training data to challenge a bug prediction model to distinguish them. The hard-to-repair bug definition remains the same but with a caveat: the more a bug differs from the original code (at multiple locations), the more distant their representations are and easier to be detected. We propose BugFarm, to transform arbitrary code into multiple complex bugs. BugFarm leverages LLMs to mutate code in multiple locations (hard-to-repair). To ensure that multiple modifications do not notably change the code representation, BugFarm analyzes the attention of the underlying model and instructs LLMs to only change the least attended locations (hard-to-detect). Our comprehensive evaluation of 320k+ bugs from over 2.5M mutants generated by BugFarm and two alternative approaches demonstrates our superiority in generating bugs that are hard to detect by learning-based bug prediction approaches and hard to repair by SOTA learning-based program repair technique.

翻译：缺陷在软件工程中至关重要；过去几十年的许多研究致力于检测、定位和修复软件系统中的缺陷。这些技术的有效性评估需要复杂缺陷，即那些难以通过测试检测和通过调试修复的缺陷。从传统软件工程角度看，难以修复的缺陷在多个位置与正确代码存在差异，导致其难以定位和修复。而难以检测的缺陷则仅在特定测试输入和可达性条件下显现。这两个目标——生成难以检测和难以修复的缺陷——基本一致：缺陷生成技术可通过修改多个语句，使其仅能在特定输入集合下被覆盖。然而，对于基于学习的技术而言，这两个目标相互矛盾：缺陷需在训练数据中与正确代码具有相似的代码表示，以挑战缺陷预测模型区分二者。难以修复缺陷的定义保持不变，但存在一个隐患：缺陷与原始代码差异越大（在多个位置上），其表示距离越远，反而越容易被检测到。我们提出BugFarm，可将任意代码转化为多个复杂缺陷。BugFarm利用大语言模型（LLM）在多个位置对代码进行变异（难以修复）。为确保多处修改不显著改变代码表示，BugFarm分析底层模型的注意力机制，并指令LLM仅修改注意力最低的位置（难以检测）。通过对BugFarm及两种替代方法生成的超过250万个变异体中的32万多个缺陷进行全面评估，证明我们在生成难以被基于学习的缺陷预测方法检测、且难以被最先进的基于学习的程序修复技术修复的缺陷方面具有显著优越性。