White-box Compiler Fuzzing Empowered by Large Language Models

Compiler correctness is crucial, as miscompilation falsifying the program behaviors can lead to serious consequences. In the literature, fuzzing has been extensively studied to uncover compiler defects. However, compiler fuzzing remains challenging: Existing arts focus on black- and grey-box fuzzing, which generates tests without sufficient understanding of internal compiler behaviors. As such, they often fail to construct programs to exercise conditions of intricate optimizations. Meanwhile, traditional white-box techniques are computationally inapplicable to the giant codebase of compilers. Recent advances demonstrate that Large Language Models (LLMs) excel in code generation/understanding tasks and have achieved state-of-the-art performance in black-box fuzzing. Nonetheless, prompting LLMs with compiler source-code information remains a missing piece of research in compiler testing. To this end, we propose WhiteFox, the first white-box compiler fuzzer using LLMs with source-code information to test compiler optimization. WhiteFox adopts a dual-model framework: (i) an analysis LLM examines the low-level optimization source code and produces requirements on the high-level test programs that can trigger the optimization; (ii) a generation LLM produces test programs based on the summarized requirements. Additionally, optimization-triggering tests are used as feedback to further enhance the test generation on the fly. Our evaluation on four popular compilers shows that WhiteFox can generate high-quality tests to exercise deep optimizations requiring intricate conditions, practicing up to 80 more optimizations than state-of-the-art fuzzers. To date, WhiteFox has found in total 96 bugs, with 80 confirmed as previously unknown and 51 already fixed. Beyond compiler testing, WhiteFox can also be adapted for white-box fuzzing of other complex, real-world software systems in general.

翻译：编译器正确性至关重要，因为破坏程序行为的错误编译可能导致严重后果。现有文献中，模糊测试已被广泛研究用于发现编译器缺陷。然而，编译器模糊测试仍面临挑战：现有技术聚焦于黑盒和灰盒模糊测试，其生成的测试未能充分理解编译器内部行为，因此常无法构造出触发复杂优化条件的程序。同时，传统白盒技术在编译器庞大代码库上计算代价过高而难以应用。最新进展表明，大语言模型（LLMs）在代码生成/理解任务中表现卓越，并在黑盒模糊测试中取得领先性能。然而，利用编译器源代码信息提示大语言模型进行编译器测试仍是研究空白。为此，我们提出WhiteFox——首个利用大语言模型结合源代码信息的白盒编译器模糊测试工具，专门用于测试编译器优化。WhiteFox采用双模型框架：（i）分析型LLM检查底层优化源代码，生成能触发该优化所需的高级测试程序要求；（ii）生成型LLM基于总结的要求生成测试程序。此外，优化触发测试作为反馈，进一步动态增强测试生成。我们在四个主流编译器上的评估表明，WhiteFox能生成高质量测试程序，触发需要复杂条件的深度优化，比现有最先进模糊测试工具多实践80余个优化。截至目前，WhiteFox共发现96个缺陷，其中80个被确认为此前未知缺陷，51个已被修复。除编译器测试外，WhiteFox还可推广至其他复杂真实软件系统的白盒模糊测试。