Detecting bugs in Deep Learning (DL) libraries (e.g., TensorFlow/PyTorch) is critical for almost all downstream DL systems in ensuring effectiveness/safety for end users. Meanwhile, traditional fuzzing techniques can be hardly effective for such a challenging domain since the input DL programs need to satisfy both the input language (e.g., Python) syntax/semantics and the DL API input/shape constraints for tensor computations. To address these limitations, we propose TitanFuzz - the first approach to directly leveraging Large Language Models (LLMs) to generate input programs for fuzzing DL libraries. LLMs are titanic models trained on billions of code snippets and can auto-regressively generate human-like code snippets. Our key insight is that modern LLMs can also include numerous code snippets invoking DL library APIs in their training corpora, and thus can implicitly learn both language syntax/semantics and intricate DL API constraints for valid DL program generation. More specifically, we use both generative and infilling LLMs (e.g., Codex/InCoder) to generate and mutate valid/diverse input DL programs for fuzzing. Our experimental results demonstrate that TitanFuzz can achieve 30.38%/50.84% higher code coverage than state-of-the-art fuzzers on TensorFlow/PyTorch. Furthermore, TitanFuzz is able to detect 65 bugs, with 41 already confirmed as previously unknown bugs. This paper demonstrates that modern titanic LLMs can be leveraged to directly perform both generation-based and mutation-based fuzzing studied for decades, while being fully automated, generalizable, and applicable to domains challenging for traditional approaches (such as DL systems). We hope TitanFuzz can stimulate more work in this promising direction of LLMs for fuzzing.
翻译:检测深度学习库(例如TensorFlow/PyTorch)中的漏洞对于几乎所有下游深度学习系统确保最终用户的有效性和安全性至关重要。然而,传统的模糊测试技术在此类具有挑战性的领域很难奏效,因为输入深度学习程序需要同时满足输入语言(如Python)的语法/语义以及深度学习API的输入/形状约束(用于张量计算)。为解决这些限制,我们提出了TitanFuzz——首次直接利用大型语言模型生成输入程序以对深度学习库进行模糊测试的方法。大型语言模型是在数十亿代码片段上训练的巨型模型,能够自回归地生成类人代码片段。我们的关键洞察在于,现代大型语言模型在训练语料库中也可包含大量调用深度学习库API的代码片段,因此能隐式学习语言语法/语义及复杂的深度学习API约束,从而生成有效的深度学习程序。具体而言,我们同时使用生成式和填充式大型语言模型(如Codex/InCoder)来生成和变异有效且多样的深度学习输入程序进行模糊测试。实验结果表明,TitanFuzz在TensorFlow/PyTorch上的代码覆盖率分别比现有最优模糊测试工具高30.38%/50.84%。此外,TitanFuzz能检测出65个漏洞,其中41个已被确认为此前未知的漏洞。本文证明,现代巨型大型语言模型可直接用于执行已研究数十年的基于生成和基于变异的模糊测试,同时实现全自动化、可泛化,并适用于传统方法难以应对的领域(如深度学习系统)。我们希望TitanFuzz能推动更多关于大语言模型用于模糊测试这一有前景方向的研究。