Out of Distribution, Out of Luck: How Well Can LLMs Trained on Vulnerability Datasets Detect Top 25 CWE Weaknesses?

Yikun Li,Ngoc Tan Bui,Ting Zhang,Chengran Yang,Xin Zhou,Martin Weyssow,Jinfeng Jiang,Junkai Chen,Huihui Huang,Huu Hung Nguyen,Chiok Yew Ho,Jie Tan,Ruiyin Li,Yide Yin,Han Wei Ang,Frank Liauw,Eng Lieh Ouh,Lwin Khin Shar,David Lo

from arxiv, Accepted for publication at ICSE 2026

Automated vulnerability detection research has made substantial progress, yet its real-world impact remains limited. Prior work found that current vulnerability datasets suffer from issues including label inaccuracy rates of 20%-71%, extensive duplication, and poor coverage of critical Common Weakness Enumeration (CWE). These issues create a significant generalization gap where models achieve misleading In-Distribution (ID) accuracies (testing on splits from the same dataset) by exploiting spurious correlations rather than learning true vulnerability patterns. To address these limitations, we present a three-part solution. First, we introduce BenchVul, which is a manually curated and balanced test dataset covering the MITRE Top 25 Most Dangerous CWEs, to enable fair model evaluation. Second, we construct a high-quality training dataset, TitanVul, comprising 38,548 functions by aggregating seven public sources and applying deduplication and validation using a novel multi-agent LLM pipeline. Third, we propose a Realistic Vulnerability Generation (RVG) pipeline, which synthesizes context-aware vulnerability examples for underrepresented but critical CWE types through simulated development workflows. Our evaluation reveals that In-Distribution (ID) performance does not reliably predict Out-of-Distribution (OOD) performance on BenchVul. For example, a model trained on BigVul achieves the highest 0.703 ID accuracy but fails on BenchVul's real-world samples (0.493 OOD accuracy). Conversely, a model trained on our TitanVul achieves the highest OOD performance on both the real-world (0.881) and synthesized (0.785) portions of BenchVul, improving upon the next-best performing dataset by 5.3% and 11.8% respectively, despite a modest ID score (0.590). Augmenting TitanVul with our RVG further boosts this leading OOD performance, improving accuracy on real-world data by 5.8% (to 0.932).

翻译：自动化漏洞检测研究已取得实质性进展，但其实际影响仍然有限。先前研究发现，当前的漏洞数据集存在诸多问题，包括20%-71%的标签错误率、大量重复数据以及对关键通用弱点枚举（CWE）的覆盖不足。这些问题导致了显著的泛化鸿沟：模型通过利用虚假相关性而非学习真实的漏洞模式，在分布内（ID）准确率（基于同一数据集的划分进行测试）上取得了误导性的高分数。为应对这些局限，我们提出了一个三部分解决方案。首先，我们引入了BenchVul，这是一个手动整理且平衡的测试数据集，覆盖了MITRE Top 25最危险CWE，以实现公平的模型评估。其次，我们构建了一个高质量训练数据集TitanVul，通过聚合七个公开来源并应用基于新型多智能体LLM流水线的去重和验证，包含了38,548个函数。第三，我们提出了一个现实漏洞生成（RVG）流水线，该流水线通过模拟开发工作流，为代表性不足但关键的CWE类型合成了上下文感知的漏洞示例。我们的评估表明，分布内（ID）性能并不能可靠地预测在BenchVul上的分布外（OOD）性能。例如，在BigVul上训练的模型取得了最高的0.703 ID准确率，但在BenchVul的真实世界样本上却表现不佳（0.493 OOD准确率）。相反，在我们的TitanVul上训练的模型在BenchVul的真实世界部分（0.881）和合成部分（0.785）均取得了最高的OOD性能，分别比次优表现数据集提高了5.3%和11.8%，尽管其ID分数一般（0.590）。使用我们的RVG对TitanVul进行数据增强，进一步提升了这一领先的OOD性能，将真实世界数据的准确率提高了5.8%（达到0.932）。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

分布外如何检测？东大等最新《视觉语言模型时代的广义异常检测及其拓展》综述

专知会员服务

25+阅读 · 2024年8月2日

【ICLR2024】能检测到LLM产生的错误信息吗？

专知会员服务

25+阅读 · 2024年1月23日