Fuzz drivers are a necessary component of API fuzzing. However, automatically generating correct and robust fuzz drivers is a difficult task. Compared to existing approaches, LLM-based (Large Language Model) generation is a promising direction due to its ability to operate with low requirements on consumer programs, leverage multiple dimensions of API usage information, and generate human-friendly output code. Nonetheless, the challenges and effectiveness of LLM-based fuzz driver generation remain unclear. To address this, we conducted a study on the effects, challenges, and techniques of LLM-based fuzz driver generation. Our study involved building a quiz with 86 fuzz driver generation questions from 30 popular C projects, constructing precise effectiveness validation criteria for each question, and developing a framework for semi-automated evaluation. We designed five query strategies, evaluated 36,506 generated fuzz drivers. Furthermore, the drivers were compared with manually written ones to obtain practical insights. Our evaluation revealed that: while the overall performance was promising (passing 91% of questions), there were still practical challenges in filtering out the ineffective fuzz drivers for large scale application; basic strategies achieved a decent correctness rate (53%), but struggled with complex API-specific usage questions. In such cases, example code snippets and iterative queries proved helpful; while LLM-generated drivers showed competent fuzzing outcomes compared to manually written ones, there was still significant room for improvement, such as incorporating semantic oracles for logical bugs detection.
翻译:模糊测试驱动是API模糊测试的必要组成部分。然而,自动生成正确且健壮的模糊测试驱动是一项艰巨任务。与现有方法相比,基于大语言模型(LLM)的生成因其对消费程序要求低、能利用多维度API使用信息、并生成友好的人类可读代码等优点,成为有前景的方向。尽管如此,基于LLM的模糊测试驱动生成的挑战与有效性仍不明确。为此,我们开展了一项关于基于LLM的模糊测试驱动生成的影响、挑战及技术的研究。我们构建了一个包含86个来自30个热门C项目的模糊测试驱动生成问题的测试集,为每个问题制定了精确的有效性验证标准,并开发了半自动化评估框架。我们设计了五种查询策略,评估了36,506个生成的模糊测试驱动。此外,还将这些驱动与人工编写的驱动进行了比较,以获得实际洞察。评估结果显示:尽管整体性能表现良好(通过91%的问题),但在大规模应用中过滤无效模糊测试驱动仍存在实际挑战;基础策略达到了不错的正确率(53%),但在处理复杂的API特定使用问题时表现不佳。在这些情况下,示例代码片段和迭代查询被证明是有用的;与人工编写的驱动相比,LLM生成的驱动在模糊测试效果上表现相当,但仍存在显著改进空间,例如引入语义预言器用于逻辑错误检测。