In the context of the rising interest in code language models (code LMs) and vulnerability detection, we study the effectiveness of code LMs for detecting vulnerabilities. Our analysis reveals significant shortcomings in existing vulnerability datasets, including poor data quality, low label accuracy, and high duplication rates, leading to unreliable model performance in realistic vulnerability detection scenarios. Additionally, the evaluation methods used with these datasets are not representative of real-world vulnerability detection. To address these challenges, we introduce PrimeVul, a new dataset for training and evaluating code LMs for vulnerability detection. PrimeVul incorporates a novel set of data labeling techniques that achieve comparable label accuracy to human-verified benchmarks while significantly expanding the dataset. It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues, alongside introducing more realistic evaluation metrics and settings. This comprehensive approach aims to provide a more accurate assessment of code LMs' performance in real-world conditions. Evaluating code LMs on PrimeVul reveals that existing benchmarks significantly overestimate the performance of these models. For instance, a state-of-the-art 7B model scored 68.26% F1 on BigVul but only 3.09% F1 on PrimeVul. Attempts to improve performance through advanced training techniques and larger models like GPT-3.5 and GPT-4 were unsuccessful, with results akin to random guessing in the most stringent settings. These findings underscore the considerable gap between current capabilities and the practical requirements for deploying code LMs in security roles, highlighting the need for more innovative research in this domain.
翻译:在代码语言模型(code LMs)和漏洞检测日益受到关注的背景下,本文研究了代码语言模型在漏洞检测中的有效性。我们的分析揭示了现有漏洞数据集的显著缺陷,包括数据质量低下、标签准确性不足以及高重复率,导致模型在实际漏洞检测场景中的性能不可靠。此外,这些数据集采用的评估方法无法代表现实世界中的漏洞检测需求。为应对这些挑战,我们提出了PrimeVul——一个专为训练和评估漏洞检测代码语言模型而构建的新数据集。PrimeVul采用了一系列创新的数据标注技术,在显著扩大数据集规模的同时,实现了与人工验证基准相当的标签准确性。该数据集还实施了严格的数据去重和按时间顺序划分数据的策略,以缓解数据泄露问题,并引入了更贴近实际的评估指标和设置。这一综合性方法旨在更准确地评估代码语言模型在真实环境中的性能。在PrimeVul上对代码语言模型的评估表明,现有基准显著高估了这些模型的性能。例如,一个先进的7B参数模型在BigVul数据集上取得了68.26%的F1分数,但在PrimeVul上仅获得3.09%的F1分数。即使尝试通过先进训练技术及更大规模模型(如GPT-3.5和GPT-4)提升性能,在最为严格的设置下其结果仍与随机猜测相当。这些发现凸显了当前代码语言模型的能力与安全领域实际部署需求之间存在的巨大差距,表明该领域亟需更具创新性的研究突破。