Large language models (LLMs) have shown strong potential for automated software vulnerability detection, particularly in retrieval-augmented generation (RAG) settings. However, for approaches relying on proprietary models and APIs, reproducibility and replicability remain largely unexplored, raising the question of whether reported results generalize or depend primarily on specific model choices. In this work, we present a reproducibility study of Vul-RAG, a RAG-based framework for source code vulnerability detection that enhances LLMs with high-level vulnerability knowledge. We first replicate the results in a fully local and open-weights setting using the reported open-weight baseline models. We then extend the evaluation to a diverse set of recent open-weight LLMs, including code-specialized, general-purpose, and reasoning models of varying parameter sizes. The results confirm that the findings of Vul-RAG are reproducible under local deployment, but with minor deviations. Across all evaluated models, we observe a performance plateau at approximately 0.30 pairwise accuracy (code pairs for which both the vulnerable and the patched function are correctly classified). Notably, this plateau persists even for more recent and advanced models, indicating that improvements in model capacity alone do not substantially enhance performance. Finally, we discuss practical implications and trade-offs between detection effectiveness, model capabilities, and model scale. Implementation and evaluation artifacts are publicly available at https://github.com/hs-esslingen-it-security/revisiting-Vul-RAG.
翻译:大型语言模型(LLMs)在自动化软件漏洞检测方面展现出巨大潜力,尤其体现在检索增强生成(RAG)场景中。然而,对于依赖专有模型和API的方法,其可重现性与可复现性仍未得到充分探索,这引发了一个问题:已报道的研究结果是否具有普适性,抑或主要依赖于特定的模型选择。本文对Vul-RAG(一种基于RAG的源代码漏洞检测框架,通过注入高层级漏洞知识增强LLMs)进行了可重现性研究。我们首先使用已报道的开放权重基线模型,在完全本地化与开放权重环境下复现了原始结果。随后我们将评估扩展至一系列多样化的近期开放权重LLMs,涵盖代码专用、通用及推理模型,参数规模各异。结果证实,Vul-RAG的研究发现在本地部署下具有可重现性,但存在微小偏差。在所有被评估模型中,我们观察到性能在约0.30的成对准确率(即可正确分类漏洞函数与修复函数对的代码对比例)处达到平台期。值得注意的是,即使对于更新、更先进的模型,这一平台期依然存在,表明单纯提升模型能力并不能显著提高性能。最后,我们探讨了检测效能、模型能力与模型规模之间的实际意义与权衡。实现代码与评估构件已公开于https://github.com/hs-esslingen-it-security/revisiting-Vul-RAG。