Large Language Models (LLMs) have demonstrated exceptional progress in multiple domains of software engineering including software vulnerability detection. Using LLMs to automate vulnerability detection in the wild is an important and relatively under-explored problem. In this paper we propose QuiLL, the first comprehensive evaluation framework for real-world vulnerability detection. Our solution consists of an end-to-end pipeline that draws together cutting-edge LLM optimization techniques and strategies specifically catering to the complexities of real-world vulnerability detection. Our specific contributions include (i) diverse prompt designs for vulnerability detection and reasoning (ii) a real-world vector data store constructed from the National Vulnerability Database to provide dynamic in-context learning, and (iii) a novel scoring metric which quantifies accuracy and reasoning quality of model predictions. QuiLL enables researchers to easily and systematically benchmark and compare the vulnerability detection capabilities of various LLMs and assess their readiness for deployment in actual code production pipelines.
翻译:大型语言模型(LLM)在软件工程多个领域(包括软件漏洞检测)展现出卓越进展。利用LLM实现真实场景下的自动化漏洞检测是一个重要且相对探索不足的问题。本文提出QuiLL——首个面向真实世界漏洞检测的综合评估框架。我们的方案包含一个端到端流水线,整合了前沿的LLM优化技术与策略,专门适配真实漏洞检测的复杂性。具体贡献包括:(i) 用于漏洞检测与推理的多样化提示设计;(ii) 基于国家漏洞数据库构建的现实向量数据存储,实现动态上下文学习;(iii) 一种新型评分指标,量化模型预测的准确性与推理质量。QuiLL使研究人员能够轻松系统地基准测试和比较各类LLM的漏洞检测能力,评估其部署至实际代码生产流水线的就绪程度。