LLM-Powered Silent Bug Fuzzing in Deep Learning Libraries via Versatile and Controlled Bug Transfer

Deep learning (DL) libraries are widely used in critical applications, where even subtle silent bugs can lead to serious consequences. While existing DL fuzzing techniques have made progress in detecting crashes, they inherently struggle to detect silent bugs due to the lack of effective test programs and corresponding oracles. Building on the observation that historical bug reports contain rich, underutilized information about silent bugs, we leverage large language models (LLMs) to perform versatile yet controlled bug transfer for silent bug fuzzing. Specifically, our approach uses LLMs to extract context-aware bug patterns from historical issues, match semantically related Application Programming Interfaces (APIs) using functionality-based embeddings, and synthesize test cases with customized oracles. This enables proactive detection of silent bugs by transferring high-risk contexts and oracle designs from known buggy APIs to functionally similar target APIs. To ensure the reliability of our context-aware bug transfer, we introduce an LLM-powered self-validation module that systematically evaluates the validity of each transferred bug instance. We implement this methodology in a tool named TransFuzz and evaluate it on three mainstream DL libraries: PyTorch, TensorFlow, and MindSpore. TransFuzz successfully discovers 79 previously unknown bugs (12 confirmed as Common Vulnerabilities and Exposures (CVEs)) in 10 bug types, demonstrating its effectiveness and generalizability in migrating DL library bug discovery capabilities.

翻译：深度学习库在关键应用中广泛使用，其中即使微小的静默缺陷也可能导致严重后果。现有深度学习模糊测试技术在检测崩溃方面已取得进展，但由于缺乏有效的测试程序与相应预言机制，本质上难以检测静默缺陷。基于历史缺陷报告中包含丰富且未充分利用的静默缺陷信息的观察，我们利用大语言模型执行多功能且可控的缺陷迁移以实现静默缺陷模糊测试。具体而言，我们的方法使用大语言模型从历史问题中提取上下文感知的缺陷模式，通过基于功能的嵌入匹配语义相关的应用程序编程接口，并合成具有定制化预言机制的测试用例。这通过将高风险上下文和预言设计从已知缺陷API迁移到功能相似的目标API，实现了静默缺陷的主动检测。为确保上下文感知缺陷迁移的可靠性，我们引入了基于大语言模型的自验证模块，系统评估每个迁移缺陷实例的有效性。我们将该方法实现为名为TransFuzz的工具，并在三个主流深度学习库（PyTorch、TensorFlow和MindSpore）上进行评估。TransFuzz成功在10种缺陷类型中发现了79个先前未知的缺陷（其中12个被确认为通用漏洞披露），证明了其在迁移深度学习库缺陷发现能力方面的有效性和泛化性。