How Effective Are They? Exploring Large Language Model Based Fuzz Driver Generation

LLM-based (Large Language Model) fuzz driver generation is a promising research area. Unlike traditional program analysis-based method, this text-based approach is more general and capable of harnessing a variety of API usage information, resulting in code that is friendly for human readers. However, there is still a lack of understanding regarding the fundamental issues on this direction, such as its effectiveness and potential challenges. To bridge this gap, we conducted the first in-depth study targeting the important issues of using LLMs to generate effective fuzz drivers. Our study features a curated dataset with 86 fuzz driver generation questions from 30 widely-used C projects. Six prompting strategies are designed and tested across five state-of-the-art LLMs with five different temperature settings. In total, our study evaluated 736,430 generated fuzz drivers, with 0.85 billion token costs ($8,000+ charged tokens). Additionally, we compared the LLM-generated drivers against those utilized in industry, conducting extensive fuzzing experiments (3.75 CPU-year). Our study uncovered that: - While LLM-based fuzz driver generation is a promising direction, it still encounters several obstacles towards practical applications; - LLMs face difficulties in generating effective fuzz drivers for APIs with intricate specifics. Three featured design choices of prompt strategies can be beneficial: issuing repeat queries, querying with examples, and employing an iterative querying process; - While LLM-generated drivers can yield fuzzing outcomes that are on par with those used in the industry, there are substantial opportunities for enhancement, such as extending contained API usage, or integrating semantic oracles to facilitate logical bug detection. Our insights have been implemented to improve the OSS-Fuzz-Gen project, facilitating practical fuzz driver generation in industry.

翻译：基于大语言模型（LLM）的模糊测试驱动生成是一个前景广阔的研究方向。与传统的基于程序分析的方法不同，这种基于文本的方法更具通用性，能够利用多样化的API使用信息，并生成对人类阅读友好的代码。然而，目前对于该方向的基本问题，如其实际效果与潜在挑战，仍缺乏深入理解。为填补这一空白，我们开展了首项针对使用LLM生成有效模糊测试驱动关键问题的深入研究。本研究构建了一个精选数据集，包含来自30个广泛使用的C语言项目的86个模糊测试驱动生成问题。我们设计了六种提示策略，并在五种前沿LLM上结合五种不同的温度设置进行了测试。研究总计评估了736,430个生成的模糊测试驱动，消耗了8.5亿token（对应超过8,000美元的计费token）。此外，我们将LLM生成的驱动与工业实践中使用的驱动进行了对比，开展了大规模的模糊测试实验（累计3.75 CPU年）。我们的研究发现：- 尽管基于LLM的模糊测试驱动生成方向前景广阔，但在走向实际应用的道路上仍面临若干障碍；- LLM难以针对具有复杂细节的API生成有效的模糊测试驱动。三种提示策略的设计选择被证明是有益的：重复查询、提供示例查询以及采用迭代查询过程；- 虽然LLM生成的驱动能够取得与工业级驱动相当的模糊测试效果，但仍存在巨大的改进空间，例如扩展所包含的API使用范围，或集成语义预言机以辅助逻辑错误检测。我们的见解已应用于改进OSS-Fuzz-Gen项目，以促进工业界实用的模糊测试驱动生成。