Variability-Aware Detection and Repair of Compilation Errors Using Foundation Models in Configurable Systems

Modern software systems often rely on conditional compilation to support optional features and multiple deployment scenarios. In configurable systems, compilation errors may arise only under specific combinations of features, remaining hidden during development and testing. Such variability-induced errors are difficult to detect in practice, as traditional compilers analyze only a single configuration at a time, while existing variability-aware tools typically require complex setup and incur high analysis costs. In this article, we present an empirical study on the use of foundation models to detect and fix compilation errors caused by feature variability in configurable C systems. We evaluate GPT-OSS-20B and GEMINI 3 PRO, and compare them with TYPECHEF, a state-of-the-art variability-aware parser. Our evaluation considers two complementary settings: 5,000 small configurable systems designed to systematically exercise variability-induced compilation behavior, comprising both systems with and without compilation errors, and 14 real-world GitHub commits, as well as an additional set of mutation testing scenarios (42). Our results show that foundation models can effectively identify variability-induced compilation errors. On small configurable systems, GPT-OSS-20B achieved a precision of 0.97, recall of 0.90, and accuracy of 0.94, substantially increasing detection coverage compared to TYPECHEF, and exhibiting performance comparable to GEMINI 3. For compilation error repair, GPT-OSS-20B produced compilable fixes in over 70% of the cases. In the analysis of real commits, CHATGPT-5.2 detected all injected faults except for two cases and identified a potential real compilation bug in a Linux commit with more than 1,000 modified lines. Our findings indicate that current state-of-the-art foundation models provide a practical and low-effort complement to traditional variability-aware analyses.

翻译：现代软件系统通常依赖条件编译来支持可选功能和多种部署场景。在可配置系统中，编译错误可能仅在特定功能组合下出现，在开发和测试过程中保持隐蔽。此类由变异性引发的错误在实践中难以检测，因为传统编译器一次仅分析单一配置，而现有的变异性感知工具通常需要复杂设置且分析成本高昂。本文通过实证研究探讨如何利用基础模型检测和修复可配置C系统中由功能变异性引发的编译错误。我们评估了GPT-OSS-20B和GEMINI 3 PRO模型，并将其与当前最先进的变异性感知解析器TYPECHEF进行对比。评估涵盖两种互补场景：包含5000个为系统化测试变异性诱导编译行为而设计的小型可配置系统（同时包含存在及不存在编译错误的系统），以及14个真实GitHub提交记录和额外一组变异测试场景（42个）。研究结果表明，基础模型能有效识别变异性引发的编译错误。在小型可配置系统中，GPT-OSS-20B实现了0.97的精确率、0.90的召回率和0.94的准确率，相比TYPECHEF显著提升了检测覆盖率，其性能与GEMINI 3相当。在编译错误修复方面，GPT-OSS-20B在超过70%的案例中成功生成可编译的修复方案。在对真实提交记录的分析中，CHATGPT-5.2除两例异常外检测出所有注入的缺陷，并在一个修改超过1000行的Linux提交中识别出潜在的真实编译错误。我们的研究结果表明，当前最先进的基础模型为传统变异性感知分析提供了实用且低成本的补充方案。