Discovering 100+ Compiler Defects in 72 Hours via LLM-Driven Semantic Logic Recomposition

Compilers constitute the foundational root-of-trust in software supply chains; however, their immense complexity inevitably conceals critical defects. Recent research has attempted to leverage historical bugs to design new mutation operators or fine-tune models to increase program diversity for compiler fuzzing.We observe, however, that bugs manifest primarily based on the semantics of input programs rather than their syntax. Unfortunately, current approaches, whether relying on syntactic mutation or general Large Language Model (LLM) fine-tuning, struggle to preserve the specific semantics found in the logic of bug-triggering programs. Consequently, these critical semantic triggers are often lost, resulting in a limitation of the diversity of generated programs. To explicitly reuse such semantics, we propose FeatureFuzz, a compiler fuzzer that combines features to generate programs. We define a feature as a decoupled primitive that encapsulates a natural language description of a bug-prone invariant, such as an out-of-bounds array access, alongside a concrete code witness of its realization. FeatureFuzz operates via a three-stage workflow: it first extracts features from historical bug reports, synthesizes coherent groups of features, and finally instantiates these groups into valid programs for compiler fuzzing. We evaluated FeatureFuzz on GCC and LLVM. Over 24-hour campaigns, FeatureFuzz uncovered 167 unique crashes, which is 2.78x more than the second-best fuzzer. Furthermore, through a 72-hour fuzzing campaign, FeatureFuzz identified 106 bugs in GCC and LLVM, 76 of which have already been confirmed by compiler developers, validating the approach's ability to stress-test modern compilers effectively.

翻译：编译器构成了软件供应链中可信赖的根基；然而，其巨大的复杂性不可避免地隐藏着关键缺陷。近期研究尝试利用历史缺陷设计新的变异算子或微调模型以提升编译器模糊测试的程序多样性。但我们观察到，缺陷的显现主要取决于输入程序的语义而非其语法。遗憾的是，当前方法——无论是依赖语法变异还是通用大语言模型（LLM）微调——都难以保留触发缺陷程序逻辑中的特定语义。因此，这些关键的语义触发条件常常丢失，导致生成程序的多样性受限。为显式复用此类语义，我们提出FeatureFuzz，一种通过组合特征生成程序的编译器模糊测试工具。我们将特征定义为解耦的基元，其封装了易出错不变量的自然语言描述（例如数组越界访问）及其实现的具体代码实例。FeatureFuzz通过三阶段工作流程运行：首先从历史缺陷报告中提取特征，随后合成特征间的连贯组合，最终将这些组合实例化为可用于编译器模糊测试的有效程序。我们在GCC和LLVM上评估了FeatureFuzz。在24小时测试周期中，FeatureFuzz发现了167个独立崩溃案例，数量达到次优模糊测试工具的2.78倍。此外，通过72小时模糊测试活动，FeatureFuzz在GCC和LLVM中识别出106个缺陷，其中76个已获编译器开发者确认，验证了该方法对现代编译器进行压力测试的有效性。