Large language models (LLMs) have demonstrated significant potential in various tasks, including vulnerability detection. However, current efforts in this area are preliminary, lacking clarity on whether LLMs' vulnerability reasoning capabilities stem from the models themselves or external aids such as knowledge retrieval and tooling support. This paper aims to isolate LLMs' vulnerability reasoning from other capabilities, such as vulnerability knowledge adoption, context information retrieval, and structured output generation. We introduce LLM4Vuln, a unified evaluation framework that separates and assesses LLMs' vulnerability reasoning capabilities and examines improvements when combined with other enhancements. We conducted controlled experiments with 97 ground-truth vulnerabilities and 97 non-vulnerable cases in Solidity and Java, testing them in a total of 9,312 scenarios across four LLMs (GPT-4, GPT-3.5, Mixtral, and Llama 3). Our findings reveal the varying impacts of knowledge enhancement, context supplementation, prompt schemes, and models. Additionally, we identified 14 zero-day vulnerabilities in four pilot bug bounty programs, resulting in \$3,576 in bounties.
翻译:大语言模型(LLMs)已在包括漏洞检测在内的多种任务中展现出巨大潜力。然而,当前该领域的研究尚处于初步阶段,对于LLMs的漏洞推理能力究竟源于模型本身,还是源于知识检索与工具支持等外部辅助,尚缺乏清晰的认识。本文旨在将LLMs的漏洞推理能力与其他能力(如漏洞知识采纳、上下文信息检索和结构化输出生成)进行解耦与分离。我们提出了LLM4Vuln,一个统一的评估框架,用于分离和评估LLMs的漏洞推理能力,并考察其与其他增强手段结合时的改进效果。我们在Solidity和Java语言中,基于97个真实漏洞和97个非漏洞案例,对四个LLM(GPT-4、GPT-3.5、Mixtral和Llama 3)进行了总计9,312个场景的受控实验。我们的研究结果揭示了知识增强、上下文补充、提示方案和模型本身的不同影响。此外,我们在四个试点漏洞赏金计划中发现了14个零日漏洞,并因此获得了3,576美元的赏金。