Agentic Fuzzing: Opportunities and Challenges

Fuzzers and static analyzers find many bugs but struggle with logic bugs in mature codebases. Triggering such a bug often requires multi-step reasoning that produces no distinctive execution feedback, and variants can appear across implementations too different for a single pattern to match. Recent LLM-assisted approaches help, but they use LLMs as auxiliaries rather than as the reasoning engine. We propose agentic fuzzing, a bug-finding approach seeded by historical bugs in which deep agents perform the reasoning directly. Given a reference bug, the agent analyzes its root cause, hypothesizes new scenarios elsewhere in the codebase that may share that cause, and verifies each hypothesis by generating and running proof-of-concept code. This lets the agent find variants that differ completely in trigger path or code structure from the reference. We identify three practical challenges in implementing agentic fuzzing: harness engineering, redundant investigations across seeds with similar root causes, and scheduling seeds in a large corpus. We address these in AFuzz through a four-stage agent pipeline, scenario coverage that deduplicates previously explored scenarios, and a DPP-MAP scheduler that orders seeds by diversity. We ran AFuzz on the V8 JavaScript engine for about one month, finding 40 bugs (including three duplicates), receiving a total $35,000 bounty, and being assigned two CVEs. AFuzz also found 19 bugs (including one duplicate) in SpiderMonkey and JavaScriptCore using the seeds from V8. However, agentic fuzzing is in its early stages with several remaining open problems we discuss in the paper. Still, we think it points to a promising direction for finding logic bugs.

翻译：模糊测试器和静态分析工具能发现许多缺陷，但在成熟代码库中却难以应对逻辑缺陷。触发此类缺陷通常需要多步推理，且不产生明显执行反馈；同一缺陷的变体可能出现在不同实现中，以至于单一模式无法匹配。近期基于大语言模型（LLM）的辅助方法虽有所进展，但其将LLM作为辅助工具而非推理引擎。我们提出"智能模糊测试"——一种以历史缺陷为种子、由深度主体直接执行推理的缺陷发现方法。给定一个参考缺陷后，该主体分析其根因，推断代码库中可能共享该缺陷的其他场景，并通过生成并运行概念验证代码来验证每个假设。这使得主体能发现触发路径或代码结构与参考缺陷完全不同的变体。我们识别出实现智能模糊测试的三个实践挑战：测试框架工程、根因相似种子间的冗余探索，以及大规模种子库的调度问题。在AFuzz系统中，我们通过四阶段主体流水线、消除重复探索场景的场景覆盖机制，以及基于多样性排序种子的DPP-MAP调度器来解决这些问题。我们在V8 JavaScript引擎上运行AFuzz约一个月，发现40个缺陷（含3个重复项），总计获得35,000美元赏金，并获得两项CVE编号。利用V8种子集，AFuzz还在SpiderMonkey和JavaScriptCore中发现19个缺陷（含1个重复项）。尽管智能模糊测试仍处于早期阶段，存在本文讨论的若干开放问题，但我们认为其为发现逻辑缺陷指明了有前景的方向。