LLM4CBI: Taming LLMs to Generate Effective Test Programs for Compiler Bug Isolation

Compiler bugs pose a significant threat to safety-critical applications, and promptly and effectively isolating these bugs is crucial for assuring the quality of compilers. However, the limited availability of debugging information on reported bugs complicates the compiler bug isolation task. Existing compiler bug isolation approaches typically convert the problem into a test program mutation problem, but they are still limited by ineffective mutation strategies or high human effort requirements. Drawing inspiration from the recent progress of pre-trained Large Language Models (LLMs), such as ChatGPT, in code generation, we propose a new approach named LLM4CBI to tame LLMs to generate effective test programs for compiler bug isolation. However, using LLMs directly for test program mutation may not yield the desired results due to the challenges associated with formulating precise prompts and selecting specialized prompts. To overcome the challenges, three new components are designed in LLM4CBI. (1) LLM4CBI utilizes a program complexity-guided prompt production component, which leverages data and control flow analysis to identify the most valuable variables and locations in programs for mutation. (2) LLM4CBI employs a memorized prompt selection component, which adopts reinforcement learning to select specialized prompts for mutating test programs continuously. (3) A test program validation component is proposed to select specialized feedback prompts to avoid repeating the same mistakes during the mutation process. Compared with the state-of-the-art approaches (DiWi and RecBi), our evaluation demonstrates the advantages of LLM4CBI: It isolates more bugs, ranging from 13.6% to 90.9% in various settings, than the other approaches. Additionally, we demonstrate that LLM4CBI is extensible, allowing for easy integration with other LLMs.

翻译：编译器缺陷对安全关键型应用构成重大威胁，及时有效地隔离这些缺陷对于确保编译器的质量至关重要。然而，已报告缺陷上调试信息的有限可用性使编译器缺陷隔离任务复杂化。现有编译器缺陷隔离方法通常将问题转化为测试程序变异问题，但它们仍受限于无效的变异策略或过高的人力需求。受预训练大语言模型（如ChatGPT）在代码生成方面最新进展的启发，我们提出了一种名为LLM4CBI的新方法，以驯服大语言模型生成用于编译器缺陷隔离的有效测试程序。然而，由于制定精准提示和选择专用提示的挑战，直接使用大语言模型进行测试程序变异可能无法产生预期结果。为克服这些挑战，LLM4CBI中设计了三个新组件：(1) LLM4CBI利用程序复杂度引导的提示生成组件，通过数据流和控制流分析识别程序中用于变异的最有价值的变量和位置；(2) LLM4CBI采用记忆型提示选择组件，通过强化学习持续选择用于变异测试程序的专用提示；(3) 提出测试程序验证组件以选择专用反馈提示，避免在变异过程中重复相同错误。与最先进方法（DiWi和RecBi）相比，我们的评估展示了LLM4CBI的优势：它在不同设置下隔离的缺陷数量比其他方法多13.6%至90.9%。此外，我们证明LLM4CBI是可扩展的，便于与其他大语言模型轻松集成。