Targeted Testing of Compiler Optimizations via Grammar-Level Composition Styles

Ensuring the correctness of compiler optimizations is critical, but existing fuzzers struggle to test optimizations effectively. First, most fuzzers use optimization pipelines (heuristics-based, fixed sequences of passes) as their harness. The phase-ordering problem can enable or preempt transformations, so pipelines inevitably miss optimization interactions; moreover, many optimizations are not scheduled, even at aggressive levels. Second, optimizations typically fire only when inputs satisfy specific structural relationships, which existing generators and mutations struggle to produce. We propose targeted fuzzing of individual optimizations to complement pipeline-based testing. Our key idea is to exploit composition styles - structural relations over program constructs (adjacency, nesting, repetition, ordering) - that optimizations look for. We build a general-purpose, grammar-based mutational fuzzer, TargetFuzz, that (i) mines composition styles from an optimization-relevant corpus, then (ii) rebuilds them inside different contexts offered by a larger, generic corpus via synthesized mutations to test variations of optimization logic. TargetFuzz is adaptable to a new programming language by lightweight, grammar-based, construct annotations - and it automatically synthesizes mutators and crossovers to rebuild composition styles. No need for hand-coded generators or language-specific mutators, which is particularly useful for modular frameworks such as MLIR, whose dialect-based, rapidly evolving ecosystem makes optimizations difficult to fuzz. Our evaluation on LLVM and MLIR shows that TargetFuzz improves coverage by 8% and 11% and triggers optimizations 2.8$\times$ and 2.6$\times$, compared to baseline fuzzers under the targeted fuzzing mode. We show that targeted fuzzing is complementary: it effectively tests all 37 sampled LLVM optimizations, while pipeline-fuzzing missed 12.

翻译：确保编译器优化的正确性至关重要，但现有模糊测试工具难以有效测试优化过程。首先，多数模糊测试工具采用优化流水线（基于启发式的固定传递序列）作为测试框架。阶段排序问题可能激活或阻止特定转换，导致流水线不可避免地遗漏优化间的交互作用；此外，即使采用激进优化级别，许多优化仍未被调度执行。其次，优化通常仅在输入满足特定结构关系时触发，而现有生成器和变异策略难以产生此类输入。我们提出针对单个优化的定向模糊测试方法，以补充基于流水线的测试。核心思想是利用优化所依赖的组合模式——即程序构造间的结构关系（相邻性、嵌套性、重复性、顺序性）。我们构建了通用的基于语法的变异模糊测试工具TargetFuzz，其能够（i）从优化相关语料库中挖掘组合模式，随后（ii）通过合成变异在更大规模通用语料库提供的不同上下文中重构这些模式，以测试优化逻辑的变体。TargetFuzz可通过轻量级、基于语法的构造注解适配新编程语言，并自动合成变异器和交叉算子以重建组合模式。该方法无需手动编写生成器或语言特定变异器，对于MLIR等模块化框架尤为实用——其基于方言的快速演进生态系统使得优化过程难以进行模糊测试。在LLVM和MLIR上的评估表明，与定向模糊测试模式下的基线工具相比，TargetFuzz分别提升覆盖率8%和11%，触发优化次数增加2.8倍和2.6倍。我们证明定向测试具有互补性：它能有效测试全部37个抽样LLVM优化，而流水线模糊测试遗漏了其中12个。