LLM4Vuln: A Unified Evaluation Framework for Decoupling and Enhancing LLMs' Vulnerability Reasoning

Large language models (LLMs) have demonstrated significant poten- tial for many downstream tasks, including those requiring human- level intelligence, such as vulnerability detection. However, recent attempts to use LLMs for vulnerability detection are still prelim- inary, as they lack an in-depth understanding of a subject LLM's vulnerability reasoning capability - whether it originates from the model itself or from external assistance, such as invoking tool sup- port and retrieving vulnerability knowledge. In this paper, we aim to decouple LLMs' vulnerability reason- ing capability from their other capabilities, including the ability to actively seek additional information (e.g., via function calling in SOTA models), adopt relevant vulnerability knowledge (e.g., via vector-based matching and retrieval), and follow instructions to out- put structured results. To this end, we propose a unified evaluation framework named LLM4Vuln, which separates LLMs' vulnerability reasoning from their other capabilities and evaluates how LLMs' vulnerability reasoning could be enhanced when combined with the enhancement of other capabilities. To demonstrate the effectiveness of LLM4Vuln, we have designed controlled experiments using 75 ground-truth smart contract vulnerabilities, which were extensively audited as high-risk on Code4rena from August to November 2023, and tested them in 4,950 different scenarios across three represen- tative LLMs (GPT-4, Mixtral, and Code Llama). Our results not only reveal ten findings regarding the varying effects of knowledge en- hancement, context supplementation, prompt schemes, and models but also enable us to identify 9 zero-day vulnerabilities in two pilot bug bounty programs with over 1,000 USD being awarded.

翻译：大语言模型在包括漏洞检测等需要人类级智能的众多下游任务中展现出显著潜力。然而，目前利用大语言模型进行漏洞检测的尝试仍处于初步阶段，原因在于缺乏对目标大语言模型漏洞推理能力的深入理解——这种能力究竟是源于模型本身，还是来自外部辅助（例如调用工具支持与检索漏洞知识）。本文旨在将大语言模型的漏洞推理能力与其其他能力解耦，这些能力包括主动寻求额外信息（例如通过SOTA模型的函数调用）、采用相关漏洞知识（例如基于向量匹配与检索）以及遵循指令输出结构化结果。为此，我们提出名为LLM4Vuln的统一评估框架，该框架将大语言模型的漏洞推理能力与其他能力分离，并评估当结合其他能力增强时，大语言模型的漏洞推理能力如何得到提升。为验证LLM4Vuln的有效性，我们设计了受控实验，使用75个经Code4rena平台在2023年8月至11月间广泛审计为高风险的真实智能合约漏洞，并在三种代表性大语言模型（GPT-4、Mixtral、Code Llama）的4950种不同场景下进行测试。实验结果不仅揭示了关于知识增强、上下文补充、提示方案及模型效果差异的十项发现，还帮助我们在两个试点漏洞赏金计划中识别出9个零日漏洞，并获得超1000美元奖励。