The widespread adoption of open-source software (OSS) has accelerated software innovation but also increased security risks due to the rapid propagation of vulnerabilities and silent patch releases. In recent years, large language models (LLMs) and LLM-based agents have demonstrated remarkable capabilities in various software engineering (SE) tasks, enabling them to effectively address software security challenges such as vulnerability detection. However, systematic evaluation of the capabilities of LLMs and LLM-based agents in security patch detection remains limited. To bridge this gap, we conduct a comprehensive evaluation of the performance of LLMs and LLM-based agents for security patch detection. Specifically, we investigate three methods: Plain LLM (a single LLM with a system prompt), Data-Aug LLM (data augmentation based on the Plain LLM), and the ReAct Agent (leveraging the thought-action-observation mechanism). We also evaluate the performance of both commercial and open-source LLMs under these methods and compare these results with those of existing baselines. Furthermore, we analyze the detection performance of these methods across various vulnerability types, and examine the impact of different prompting strategies and context window sizes on the results. Our findings reveal that the Data-Aug LLM achieves the best overall performance, whereas the ReAct Agent demonstrates the lowest false positive rate (FPR). Although baseline methods exhibit strong accuracy, their false positive rates are significantly higher. In contrast, our evaluated methods achieve comparable accuracy while substantially reducing the FPR. These findings provide valuable insights into the practical applications of LLMs and LLM-based agents in security patch detection, highlighting their advantage in maintaining robust performance while minimizing false positive rates.
翻译:开源软件的广泛采用加速了软件创新,但也因漏洞的快速传播和静默补丁发布而增加了安全风险。近年来,大语言模型及其驱动的智能体在各类软件工程任务中展现出卓越能力,使其能够有效应对漏洞检测等软件安全挑战。然而,针对LLMs及基于LLM的智能体在安全补丁检测能力的系统性评估仍较为有限。为填补这一空白,本研究对LLMs及基于LLM的智能体在安全补丁检测任务中的性能进行了全面评估。具体而言,我们探究了三种方法:基础LLM(采用系统提示词的单一模型)、数据增强LLM(基于基础LLM进行数据增强)以及ReAct智能体(利用思维-行动-观察机制)。我们同时评估了商业与开源LLM在这些方法下的表现,并与现有基线方法进行对比。此外,我们分析了这些方法在不同漏洞类型上的检测性能,并探究了不同提示策略与上下文窗口尺寸对结果的影响。研究发现:数据增强LLM实现了最佳综合性能,而ReAct智能体则展现出最低的误报率。虽然基线方法表现出较高的准确率,但其误报率显著偏高。相比之下,本研究评估的方法在保持相当准确率的同时,大幅降低了误报率。这些发现为LLMs及基于LLM的智能体在安全补丁检测中的实际应用提供了重要见解,凸显了其在维持稳健性能的同时降低误报率的优势。