Fuzzing Deep-Learning Libraries via Large Language Models

Detecting bugs in Deep Learning (DL) libraries is critical for almost all downstream DL systems in ensuring effectiveness and safety for the end users. As such, researchers have started developing various fuzzing or testing techniques targeting DL libraries. Previous work can be mainly classified into API-level fuzzing and model-level fuzzing. However, both types of techniques cannot detect bugs that can only be exposed by complex API sequences - API-level fuzzers cannot cover API sequences, while model-level fuzzers can only cover specific API sequence patterns and a small subset of APIs due to complicated input/shape constraints for tensor computations. To address these limitations, we propose LLMFuzz - the first automated approach to directly leveraging Large Pre-trained Language Models (LLMs) to generate input programs for fuzzing DL libraries. LLMs are trained on billions of code snippets and can autoregressively generate human-like code snippets. Our key insight is that modern LLMs can also include numerous code snippets invoking DL library APIs in their training corpora, and thus can implicitly learn the intricate DL API constraints and directly generate/mutate valid DL programs for fuzzing DL libraries. More specifically, we first directly use a generative LLM (e.g., Codex) to generate highquality seed programs based on input prompts. Then, we leverage an evolutionary fuzzing loop which applies an infilling LLM (e.g., InCoder) to further perform small mutations on the seed programs to generate more diverse API sequences for fuzzing DL libraries. Our experimental results on popular DL libraries demonstrate that LLMFuzz is able to cover 91.11% / 24.09% more APIs and achieve 30.38% / 50.84% higher code coverage than state-of-the-art fuzzers on TensorFlow / PyTorch. Furthermore, LLMFuzz is able to detect 65 bugs, with 41 already confirmed as previously unknown bugs.

翻译：检测深度学习库中的漏洞对于保障下游所有深度学习系统的有效性及最终用户的安全性至关重要。为此，研究人员已开始开发多种针对深度学习库的模糊测试或测试技术。先前的工作主要分为API级模糊测试和模型级模糊测试两类。然而，这两类技术均无法检测仅能通过复杂API序列暴露的漏洞——API级模糊器无法覆盖API序列，而模型级模糊器由于张量计算中复杂的输入/形状约束，仅能覆盖特定的API序列模式及小部分API。为解决这些局限性，我们提出LLMFuzz——首个直接利用大型预训练语言模型生成输入程序以对深度学习库进行模糊测试的自动化方法。大型语言模型基于数十亿代码片段训练而成，可自回归生成类人代码片段。我们的核心见解在于：现代大型语言模型的训练语料中包含大量调用深度学习库API的代码片段，因此能隐式学习复杂的深度学习API约束，并直接生成/变异有效的深度学习程序以进行模糊测试。具体而言，我们首先直接使用生成式语言模型（如Codex）基于输入提示生成高质量种子程序，随后利用演化模糊测试循环，通过填充型语言模型（如InCoder）对种子程序执行细微变异，生成更多样化的API序列以模糊测试深度学习库。在主流深度学习库上的实验结果表明，与现有最先进的模糊器相比，LLMFuzz在TensorFlow/PyTorch上可覆盖额外91.11%/24.09%的API，并实现30.38%/50.84%的代码覆盖率提升。此外，LLMFuzz共检测到65个漏洞，其中41个已被确认为此前未知的漏洞。