In this paper, we present the first comprehensive empirical study of specialized LLM-based detectors and compare them with traditional static analyzers at the project scale. Specifically, our study evaluates five latest and representative LLM-based methods and two traditional tools using: 1) an in-house benchmark of 222 known real-world vulnerabilities (C/C++ and Java) to assess detection capability, and 2) 24 active open-source projects, where we manually inspected 385 warnings to assess their practical usability and underlying root causes of failures. Our evaluation yields three key findings: First, while LLM-based detectors exhibit low recall on the in-house benchmark, they still uncover more unique vulnerabilities than traditional tools. Second, in open-source projects, both LLM-based and traditional tools generate substantial warnings but suffer from very high false discovery rates, hindering practical use. Our manual analysis further reveals shallow interprocedural reasoning and misidentified source/sink pairs as primary failure causes, with LLM-based tools exhibiting additional unique failures. Finally, LLM-based methods incurs substantial computational costs-hundreds of thousands to hundreds of millions of tokens and multi-hour to multi-day runtimes. Overall, our findings underscore critical limitations in the robustness, reliability, and scalability of current LLM-based detectors. We ultimately summarize a set of implications for future research toward more effective and practical project-scale vulnerability detection.
翻译:本文首次对专用基于LLM的检测器进行了全面实证研究,并在项目尺度上将其与传统静态分析工具进行比较。具体而言,本研究通过以下方式评估了五种最新且具有代表性的基于LLM的方法及两种传统工具:1)使用包含222个已知真实漏洞(C/C++与Java)的内部基准评估检测能力;2)选取24个活跃开源项目,通过人工审查385条告警来评估其实际可用性及失效的根本原因。我们的评估得出三个关键发现:首先,基于LLM的检测器在内部基准上召回率较低,但仍能比传统工具发现更多独特漏洞。其次,在开源项目中,基于LLM和传统工具均产生大量告警,但其误报率极高,阻碍了实际应用。进一步的人工分析表明,浅层跨过程推理及源/汇点对误判是主要失效原因,而基于LLM的工具还表现出其他独特失效模式。最后,基于LLM的方法需消耗大量计算资源——处理数十万至数亿token并耗时数小时至数天。总体而言,我们的研究结果揭示了当前基于LLM的检测器在鲁棒性、可靠性和可扩展性方面存在显著局限。最终,我们总结了对未来研究的启示,以推动实现更有效、实用的项目级漏洞检测。