Automatic, Expressive, and Scalable Fuzzing with Stitching

Fuzzing is a powerful technique for finding bugs in software libraries, but scaling it remains difficult. Automated harness generation commits to fixed API sequences at synthesis time, limiting the behaviors each harness can test. Approaches that instead explore new sequences dynamically lack the expressiveness to model real-world usage constraints leading to false positives from straightforward API misuse. We propose stitching, a technique that encodes API usage constraints in pieces that a fuzzer dynamically assembles at runtime. A static type system governs how objects flow between blocks, while a dynamically-checked extrinsic typestate tracks arbitrary metadata across blocks, enabling specifications to express rich semantic constraints such as object state dependencies and cross-function preconditions. This allows a single specification to describe an open-ended space of valid API interactions that the fuzzer explores guided by coverage feedback. We implement stitching in STITCH, using LLMs to automatically configure projects for fuzzing, synthesize a specification, triage crashes, and repair the specification itself. We evaluated STITCH against four state-of-the-art tools on 33 benchmarks, where it achieved the highest code coverage on 21 and found 30 true-positive bugs compared to 10 by all other tools combined, with substantially higher precision (70% vs. 12% for the next-best LLM-based tool). Deployed automatically on 1365 widely used open-source projects, STITCH discovered 131 new bugs across 102 projects, 73 of which have already been patched.

翻译：模糊测试是发现软件库中缺陷的强大技术，但其规模化应用仍面临挑战。自动化测试工具生成在合成阶段即固定API调用序列，限制了每个测试工具所能验证的行为模式。而动态探索新序列的方法则缺乏对实际使用约束的表达能力，导致因简单API误用而产生大量误报。本文提出缝合技术，该技术将API使用约束编码为可独立组合的模块，由模糊测试器在运行时动态组装。静态类型系统管理对象在模块间的流转，而动态检查的外部类型状态则跨模块追踪任意元数据，使得规范能够表达丰富的语义约束，例如对象状态依赖性与跨函数前置条件。这使得单一规范即可描述一个开放的有效API交互空间，模糊测试器在覆盖率反馈的引导下对此空间进行探索。我们在STITCH系统中实现了缝合技术，利用大语言模型自动配置待测项目、合成规范、分类崩溃报告并修复规范本身。我们在33个基准测试上将STITCH与四种前沿工具进行对比评估：STITCH在21个测试中取得了最高代码覆盖率，并发现了30个真实缺陷，而其他所有工具合计仅发现10个，其检测精度也显著更高（达70%，而次优的基于大语言模型的工具仅为12%）。在1365个广泛使用的开源项目中自动部署后，STITCH在102个项目中发现了131个新缺陷，其中73个已获修复。