Fuzzing Deep-Learning Libraries via Large Language Models

Detecting bugs in Deep Learning (DL) libraries (e.g., TensorFlow/PyTorch) is critical for almost all downstream DL systems in ensuring effectiveness/safety for end users. Meanwhile, traditional fuzzing techniques can be hardly effective for such a challenging domain since the input DL programs need to satisfy both the input language (e.g., Python) syntax/semantics and the DL API input/shape constraints for tensor computations. To address these limitations, we propose TitanFuzz - the first approach to directly leveraging Large Language Models (LLMs) to generate input programs for fuzzing DL libraries. LLMs are titanic models trained on billions of code snippets and can auto-regressively generate human-like code snippets. Our key insight is that modern LLMs can also include numerous code snippets invoking DL library APIs in their training corpora, and thus can implicitly learn both language syntax/semantics and intricate DL API constraints for valid DL program generation. More specifically, we use both generative and infilling LLMs (e.g., Codex/InCoder) to generate and mutate valid/diverse input DL programs for fuzzing. Our experimental results demonstrate that TitanFuzz can achieve 30.38%/50.84% higher code coverage than state-of-the-art fuzzers on TensorFlow/PyTorch. Furthermore, TitanFuzz is able to detect 65 bugs, with 41 already confirmed as previously unknown bugs. This paper demonstrates that modern titanic LLMs can be leveraged to directly perform both generation-based and mutation-based fuzzing studied for decades, while being fully automated, generalizable, and applicable to domains challenging for traditional approaches (such as DL systems). We hope TitanFuzz can stimulate more work in this promising direction of LLMs for fuzzing.

翻译：检测深度学习库（例如 TensorFlow/PyTorch）中的漏洞，对于几乎所有下游深度学习系统确保最终用户的有效性与安全性至关重要。然而，传统模糊测试技术在此类高难度领域难有成效，因为输入的深度学习程序需同时满足输入语言（如 Python）的语法/语义规则，以及张量计算中深度学习 API 的输入/形状约束。为应对这些限制，我们提出 TitanFuzz——这是首个直接利用大语言模型生成输入程序以对深度学习库进行模糊测试的方法。大语言模型是基于数十亿代码片段训练而成的巨型模型，能够自回归生成类似人类编写的代码片段。我们的关键洞察在于：现代大语言模型的训练语料中包含了大量调用深度学习库 API 的代码片段，因此它们能隐式学习语言语法/语义及复杂的深度学习 API 约束，从而生成有效的深度学习程序。具体而言，我们同时使用生成式与填充式大语言模型（例如 Codex/InCoder），来生成和变异有效且多样化的深度学习输入程序以进行模糊测试。实验结果表明，TitanFuzz 在 TensorFlow/PyTorch 上实现的代码覆盖率比当前最先进的模糊测试工具高出 30.38%/50.84%。此外，TitanFuzz 能够检测出 65 个漏洞，其中 41 个已被确认为此前未知的漏洞。本文证明，现代巨型大语言模型可被直接用于执行已研究数十年的基于生成和基于变异的模糊测试，且整个过程全自动化、可泛化，并适用于传统方法难以应对的领域（如深度学习系统）。我们期望 TitanFuzz 能激发更多将大语言模型用于模糊测试这一前景方向的研究工作。