Software is prone to security vulnerabilities. Program analysis tools to detect them have limited effectiveness in practice due to their reliance on human labeled specifications. Large language models (or LLMs) have shown impressive code generation capabilities but they cannot do complex reasoning over code to detect such vulnerabilities especially since this task requires whole-repository analysis. We propose IRIS, a neuro-symbolic approach that systematically combines LLMs with static analysis to perform whole-repository reasoning for security vulnerability detection. Specifically, IRIS leverages LLMs to infer taint specifications and perform contextual analysis, alleviating needs for human specifications and inspection. For evaluation, we curate a new dataset, CWE-Bench-Java, comprising 120 manually validated security vulnerabilities in real-world Java projects. A state-of-the-art static analysis tool CodeQL detects only 27 of these vulnerabilities whereas IRIS with GPT-4 detects 55 (+28) and improves upon CodeQL's average false discovery rate by 5% points. Furthermore, IRIS identifies 6 previously unknown vulnerabilities which cannot be found by existing tools.
翻译:软件易受安全漏洞影响。用于检测漏洞的程序分析工具在实践中效果有限,因为它们依赖于人工标注的规范。大型语言模型(或LLMs)已展现出卓越的代码生成能力,但无法对代码进行复杂推理以检测此类漏洞,尤其是该任务需要全仓库分析。我们提出IRIS,一种神经符号方法,系统性地将LLMs与静态分析相结合,以执行全仓库推理进行安全漏洞检测。具体而言,IRIS利用LLMs推断污点规范并执行上下文分析,从而减少对人类规范和检查的需求。为进行评估,我们构建了一个新数据集CWE-Bench-Java,包含120个在真实Java项目中经人工验证的安全漏洞。最先进的静态分析工具CodeQL仅检测到其中27个漏洞,而结合GPT-4的IRIS检测到55个(+28),并将CodeQL的平均误报率降低了5个百分点。此外,IRIS识别出6个现有工具无法发现的先前未知漏洞。