May the Feedback Be with You! Unlocking the Power of Feedback-Driven Deep Learning Framework Fuzzing via LLMs

Deep Learning (DL) frameworks have served as fundamental components in DL systems over the last decade. However, bugs in DL frameworks could lead to catastrophic consequences in critical scenarios. A simple yet effective way to find bugs in DL frameworks is fuzz testing (Fuzzing). Existing approaches focus on test generation, leaving execution results with high semantic value (e.g., coverage information, bug reports, and exception logs) in the wild, which can serve as multiple types of feedback. To fill this gap, we propose FUEL to effectively utilize the feedback information, which comprises two Large Language Models (LLMs): analysis LLM and generation LLM. Specifically, analysis LLM infers analysis summaries from feedback information, while the generation LLM creates tests guided by these summaries. Furthermore, based on multiple feedback guidance, we design two additional components: (i) a feedback-aware simulated annealing algorithm to select operators for test generation, enriching test diversity. (ii) a program self-repair strategy to automatically repair invalid tests, enhancing test validity. We evaluate FUEL on the two most popular DL frameworks, and experiment results show that FUEL can improve line code coverage of PyTorch and TensorFlow by 4.48% and 9.14% over four state-of-the-art baselines. By the time of submission, FUEL has detected 104 previously unknown bugs for PyTorch and TensorFlow, with 93 confirmed as new bugs, 53 already fixed. 14 vulnerabilities have been assigned CVE IDs, among which 7 are rated as high-severity with a CVSS score of "7.5 HIGH". Our artifact is available at https://github.com/NJU-iSE/FUEL

翻译：深度学习（DL）框架在过去十年中一直是DL系统的基础组成部分。然而，DL框架中的缺陷可能在关键场景中导致灾难性后果。一种简单而有效的发现DL框架缺陷的方法是模糊测试（Fuzzing）。现有方法侧重于测试生成，而将具有高语义价值的执行结果（例如，覆盖率信息、缺陷报告和异常日志）置于未充分利用状态，这些结果可以作为多种类型的反馈。为填补这一空白，我们提出FUEL以有效利用反馈信息，它包含两个大型语言模型（LLM）：分析LLM和生成LLM。具体而言，分析LLM从反馈信息推断分析摘要，而生成LLM则基于这些摘要创建测试。此外，基于多重反馈指导，我们设计了两个额外组件：（i）一种反馈感知的模拟退火算法，用于选择测试生成算子，丰富测试多样性；（ii）一种程序自修复策略，自动修复无效测试，提升测试有效性。我们在两个最流行的DL框架上评估FUEL，实验结果表明，与四种最先进的基线方法相比，FUEL能将PyTorch和TensorFlow的代码行覆盖率分别提高4.48%和9.14%。截至提交时，FUEL已为PyTorch和TensorFlow检测到104个先前未知的缺陷，其中93个被确认为新缺陷，53个已被修复。14个漏洞已分配CVE ID，其中7个被评为高危漏洞，CVSS评分为“7.5 HIGH”。我们的项目代码可在 https://github.com/NJU-iSE/FUEL 获取。