Discovering 100+ Compiler Defects in 72 Hours via LLM-Driven Semantic Logic Recomposition

Compilers constitute the foundational root-of-trust in software supply chains; however, their immense complexity inevitably conceals critical defects. Recent research has attempted to leverage historical bugs to design new mutation operators or fine-tune models to increase program diversity for compiler fuzzing.We observe, however, that bugs manifest primarily based on the semantics of input programs rather than their syntax. Unfortunately, current approaches, whether relying on syntactic mutation or general Large Language Model (LLM) fine-tuning, struggle to preserve the specific semantics found in the logic of bug-triggering programs. Consequently, these critical semantic triggers are often lost, resulting in a limitation of the diversity of generated programs. To explicitly reuse such semantics, we propose FeatureFuzz, a compiler fuzzer that combines features to generate programs. We define a feature as a decoupled primitive that encapsulates a natural language description of a bug-prone invariant, such as an out-of-bounds array access, alongside a concrete code witness of its realization. FeatureFuzz operates via a three-stage workflow: it first extracts features from historical bug reports, synthesizes coherent groups of features, and finally instantiates these groups into valid programs for compiler fuzzing. We evaluated FeatureFuzz on GCC and LLVM. Over 24-hour campaigns, FeatureFuzz uncovered 167 unique crashes, which is 2.78x more than the second-best fuzzer. Furthermore, through a 72-hour fuzzing campaign, FeatureFuzz identified 113 bugs in GCC and LLVM, 97 of which have already been confirmed by compiler developers, validating the approach's ability to stress-test modern compilers effectively.

翻译：编译器构成了软件供应链中的基础信任根；然而，其巨大的复杂性不可避免地隐藏着关键缺陷。近期研究尝试利用历史缺陷来设计新的变异算子或微调模型，以增加编译器模糊测试的程序多样性。但我们观察到，缺陷的显现主要基于输入程序的语义而非其语法。遗憾的是，当前方法——无论是依赖语法变异还是通用大语言模型（LLM）微调——都难以保留触发缺陷的程序逻辑中存在的特定语义。因此，这些关键的语义触发器常常丢失，导致生成程序的多样性受限。为了显式地复用此类语义，我们提出了FeatureFuzz，一种通过组合特征来生成程序的编译器模糊测试工具。我们将特征定义为一个解耦的基元，它封装了易产生缺陷的不变式的自然语言描述（例如数组越界访问）及其实现的具体代码见证。FeatureFuzz通过三阶段工作流程运行：首先从历史缺陷报告中提取特征，然后合成连贯的特征组，最后将这些特征组实例化为有效的程序以进行编译器模糊测试。我们在GCC和LLVM上评估了FeatureFuzz。在24小时的测试活动中，FeatureFuzz发现了167个独特的崩溃，这是次优模糊测试工具的2.78倍。此外，通过72小时的模糊测试活动，FeatureFuzz在GCC和LLVM中识别出113个缺陷，其中97个已被编译器开发者确认，验证了该方法有效压力测试现代编译器的能力。