Large Language Models (LLMs) have been suggested for use in automated vulnerability repair, but benchmarks showing they can consistently identify security-related bugs are lacking. We thus develop SecLLMHolmes, a fully automated evaluation framework that performs the most detailed investigation to date on whether LLMs can reliably identify and reason about security-related bugs. We construct a set of 228 code scenarios and analyze eight of the most capable LLMs across eight different investigative dimensions using our framework. Our evaluation shows LLMs provide non-deterministic responses, incorrect and unfaithful reasoning, and perform poorly in real-world scenarios. Most importantly, our findings reveal significant non-robustness in even the most advanced models like `PaLM2' and `GPT-4': by merely changing function or variable names, or by the addition of library functions in the source code, these models can yield incorrect answers in 26% and 17% of cases, respectively. These findings demonstrate that further LLM advances are needed before LLMs can be used as general purpose security assistants.
翻译:大型语言模型(LLMs)被提议用于自动化漏洞修复,但缺乏表明它们能持续识别安全相关漏洞的基准测试。为此,我们开发了SecLLMHolmes——一个全自动评估框架,通过最详尽的调查,探究LLMs能否可靠地识别并推理安全相关漏洞。我们构建了228个代码场景数据集,利用该框架从八个不同研究维度分析了八种最强大的LLM。评估结果显示,LLMs存在响应非确定性、推理错误及不忠实的问题,在实际场景中表现欠佳。最重要的是,我们的发现揭示了即使是最先进的模型(如PaLM2和GPT-4)也存在显著的不鲁棒性:仅通过修改函数或变量名称,或在源代码中添加库函数,这些模型在26%和17%的案例中会得出错误答案。这些发现表明,在LLMs能够用作通用安全助手之前,仍需进一步推进模型的发展。