Understanding Large Language Model Based Fuzz Driver Generation

Fuzz drivers are necessary for library API fuzzing. Automatic fuzz driver generation is challenging since it requires generating high quality API usage code which is correct and robust. Large language model based fuzz driver generator is a promising direction. Compared to traditional program analysis based generators, it is a text-based approach which is more lightweight and general. It can easily leverage various sources of API usage information for generation and generate human-friendly code. Nonetheless, there still lacks the basic understanding on this direction. To fill this gap, we did a study aiming the core issues of using LLMs on effective fuzz driver generation. For systematic understanding, 5 query strategies are designed and analyzed from basic to enhanced. For evaluation in scale, we built a semi-automatic framework, containing a quiz with 86 driver generation questions collected from 30 popular C projects, and a set of criteria for precise driver effectiveness validation. In total, 189,628 fuzz drivers using 0.22 billion tokens are generated and evaluated. Besides, generated drivers were compared with industrial used ones to obtain practical insights (3.12 CPU year fuzzing experiments). Our study revealed: 1) LLM-based generation has shown promising practicality. 64% questions can be solved entirely automatically and the number rises to 91% if manual semantic validators are incorporated. Moreover, the generated drivers exhibited competitive performance to those commonly employed in the industry; 2) LLMs struggle to generate fuzz drivers that require complex API usage specifics. Three key designs can help: repeatedly querying, querying with examples, and iteratively querying. Combining them yields a dominant strategy; 3) Significant rooms for improvement are still left, such as automatic semantic correctness validation, API usage expansion, and semantic oracle generation.

翻译：模糊驱动是进行库API模糊测试的必要条件。自动生成模糊驱动极具挑战性，因为它需要生成高质量、正确且健壮的API使用代码。基于大语言模型的模糊驱动生成器是一个有前景的方向。与传统的基于程序分析的生成器相比，它是一种基于文本的方法，更加轻量级且通用。它能轻松利用多种API使用信息源进行生成，并生成人类友好的代码。然而，目前对这一方向仍缺乏基本理解。为填补这一空白，我们开展了一项研究，聚焦于使用大语言模型进行高效模糊驱动生成的核心问题。为系统理解，我们设计并分析了从基础到增强的5种查询策略。为进行规模评估，我们构建了一个半自动化框架，包含从30个流行C项目中收集的86个驱动生成问题的测试集，以及一套用于精确驱动有效性验证的标准。总计生成并评估了189,628个模糊驱动，使用了0.22亿个令牌。此外，我们将生成的驱动与工业界使用的驱动进行比较，以获得实际见解（3.12 CPU年的模糊测试实验）。我们的研究揭示：1）基于大语言模型的生成已展现出有前景的实用性。64%的问题可以完全自动化解决，若结合人工语义验证器，该比例上升至91%。此外，生成的驱动在性能上与业界常用驱动具有竞争力；2）大语言模型在生成需要复杂API使用细节的模糊驱动时存在困难。三种关键设计有助于改进：重复查询、带示例查询以及迭代查询。将它们结合可得到一种主导策略；3）仍有显著的改进空间，例如自动语义正确性验证、API使用扩展以及语义测试预言生成。