Software is prone to security vulnerabilities. Program analysis tools to detect them have limited effectiveness in practice. While large language models (or LLMs) have shown impressive code generation capabilities, they cannot do complex reasoning over code to detect such vulnerabilities, especially because this task requires whole-repository analysis. In this work, we propose IRIS, the first approach that systematically combines LLMs with static analysis to perform whole-repository reasoning to detect security vulnerabilities. We curate a new dataset, CWE-Bench-Java, comprising 120 manually validated security vulnerabilities in real-world Java projects. These projects are complex, with an average of 300,000 lines of code and a maximum of up to 7 million. Out of 120 vulnerabilities in CWE-Bench-Java, IRIS detects 69 using GPT-4, while the state-of-the-art static analysis tool only detects 27. Further, IRIS also significantly reduces the number of false alarms (by more than 80% in the best case).
翻译:软件易于出现安全漏洞。用于检测这些漏洞的程序分析工具在实践中效果有限。虽然大型语言模型(或LLM)已展现出令人印象深刻的代码生成能力,但它们无法对代码进行复杂推理以检测此类漏洞,特别是因为此任务需要全仓库分析。在本工作中,我们提出了IRIS,这是首个系统性地将LLM与静态分析相结合以执行全仓库推理来检测安全漏洞的方法。我们构建了一个新的数据集CWE-Bench-Java,包含120个经过人工验证的真实Java项目中的安全漏洞。这些项目非常复杂,平均拥有30万行代码,最多可达700万行。在CWE-Bench-Java的120个漏洞中,IRIS使用GPT-4检测出69个,而最先进的静态分析工具仅检测出27个。此外,IRIS还显著减少了误报数量(在最佳情况下减少超过80%)。