Large Language Models (LLMs) have been a promising way for automated vulnerability detection. However, most prior studies have explored the use of LLMs to detect vulnerabilities only within single functions, disregarding those related to interprocedural dependencies. These studies overlook vulnerabilities that arise from data and control flows that span multiple functions. Thus, leveraging the context provided by callers and callees may help identify vulnerabilities. This study empirically investigates the effectiveness of detection, the inference cost, and the quality of explanations of four modern LLMs (Claude Haiku 4.5, GPT-4.1 Mini, GPT-5 Mini, and Gemini 3 Flash) in detecting vulnerabilities related to interprocedural dependencies. To do that, we conducted an empirical study on 509 vulnerabilities from the ReposVul dataset, systematically varying the level of interprocedural context (target function code-only, target function + callers, and target function + callees) and evaluating the four modern LLMs across C, C++, and Python. The results show that Gemini 3 Flash offers the best cost-effectiveness trade-off for C vulnerabilities, achieving F1 >= 0.978 at an estimated cost of $0.50-$0.58 per configuration, and Claude Haiku 4.5 correctly identified and explained the vulnerability in 93.6% of the evaluated cases. Overall, the findings have direct implications for the design of AI-assisted security analysis tools that can generalize across codebases in multiple programming languages.
翻译:大语言模型(LLMs)已成为自动化漏洞检测领域的一种有前景的方法。然而,大多数先前研究仅探讨LLMs在单一函数内检测漏洞的能力,忽略了与跨过程依赖相关的漏洞。这些研究忽视了因跨越多个函数的数据流和控制流而产生的漏洞。因此,利用调用者与被调用者提供的上下文可能有助于识别漏洞。本研究通过实证方法系统探究了四种现代大语言模型(Claude Haiku 4.5、GPT-4.1 Mini、GPT-5 Mini 和 Gemini 3 Flash)在检测跨过程依赖漏洞时的检测有效性、推理成本以及解释质量。为此,我们针对ReposVul数据集中的509个漏洞开展实证研究,系统性地改变跨过程上下文的层级(仅目标函数代码、目标函数+调用者、目标函数+被调用者),并在C、C++和Python三种语言上评估这四种现代LLMs的表现。结果表明,Gemini 3 Flash在C语言漏洞检测中实现了最佳性价比,在每配置估计成本0.50-0.58美元的情况下F1值≥0.978,而Claude Haiku 4.5在93.6%的评估案例中正确识别并解释了漏洞。总体而言,这些发现对设计能够跨多种编程语言代码库泛化的AI辅助安全分析工具具有直接启示意义。