Sifting the Noise: A Comparative Study of LLM Agents in Vulnerability False Positive Filtering

Static Application Security Testing (SAST) tools are essential for identifying software vulnerabilities, but they often produce a high volume of false positives (FPs), imposing a substantial manual triage burden on developers. Recent advances in Large Language Model (LLM) agents offer a promising direction by enabling iterative reasoning, tool use, and environment interaction to refine SAST alerts. However, the comparative effectiveness of different LLM-based agent architectures for FP filtering remains poorly understood. In this paper, we present a comparative study of three state-of-the-art LLM-based agent frameworks, i.e., Aider, OpenHands, and SWE-agent, for vulnerability FP filtering. We evaluate these frameworks using the vulnerabilities from the OWASP Benchmark and real-world open-source Java projects. The experimental results show that LLM-based agents can remove the majority of SAST noise, reducing an initial FP detection rate of over 92% on the OWASP Benchmark to as low as 6.3% in the best configuration. On real-world dataset, the best configuration of LLM-based agents can achieve an FP identification rate of up to 93.3% involving CodeQL alerts. However, the benefits of agents are strongly backbone- and CWE-dependent: agentic frameworks significantly outperform vanilla prompting for stronger models such as Claude Sonnet 4 and GPT-5, but yield limited or inconsistent gains for weaker backbones. Moreover, aggressive FP reduction can come at the cost of suppressing true vulnerabilities, highlighting important trade-offs. Finally, we observe large disparities in computational cost across agent frameworks. Overall, our study demonstrates that LLM-based agents are a powerful but non-uniform solution for SAST FP filtering, and that their practical deployment requires careful consideration of agent design, backbone model choice, vulnerability category, and operational cost.

翻译：静态应用程序安全测试（SAST）工具对于识别软件漏洞至关重要，但它们通常会产生大量误报（FPs），给开发人员带来了繁重的人工排查负担。大型语言模型（LLM）智能体的最新进展提供了一个有前景的方向，它能够通过迭代推理、工具使用和环境交互来优化SAST警报。然而，不同基于LLM的智能体架构在误报过滤方面的比较有效性仍不清楚。本文对三种最先进的基于LLM的智能体框架（即Aider、OpenHands和SWE-agent）在漏洞误报过滤方面进行了比较研究。我们使用OWASP Benchmark中的漏洞以及真实世界的开源Java项目来评估这些框架。实验结果表明，基于LLM的智能体能够消除大部分SAST噪声，在最佳配置下，将OWASP Benchmark上初始超过92%的误报检测率降低至最低6.3%。在真实世界数据集上，基于LLM的智能体的最佳配置对CodeQL警报的误报识别率最高可达93.3%。然而，智能体的优势强烈依赖于骨干模型和CWE类别：对于Claude Sonnet 4和GPT-5等较强模型，智能体框架显著优于普通提示方法，但对于较弱的骨干模型，其带来的增益有限或不一致。此外，激进的误报减少可能会以抑制真实漏洞为代价，凸显了重要的权衡关系。最后，我们观察到不同智能体框架之间的计算成本存在巨大差异。总体而言，我们的研究表明，基于LLM的智能体是SAST误报过滤的一种强大但不均衡的解决方案，其实际部署需要仔细考虑智能体设计、骨干模型选择、漏洞类别以及运营成本。