Automated vulnerability detection research has made substantial progress, yet its real-world impact remains limited. Prior work found that current vulnerability datasets suffer from issues including label inaccuracy rates of 20%-71%, extensive duplication, and poor coverage of critical Common Weakness Enumeration (CWE). These issues create a significant generalization gap where models achieve misleading In-Distribution (ID) accuracies (testing on splits from the same dataset) by exploiting spurious correlations rather than learning true vulnerability patterns. To address these limitations, we present a three-part solution. First, we introduce BenchVul, which is a manually curated and balanced test dataset covering the MITRE Top 25 Most Dangerous CWEs, to enable fair model evaluation. Second, we construct a high-quality training dataset, TitanVul, comprising 38,548 functions by aggregating seven public sources and applying deduplication and validation using a novel multi-agent LLM pipeline. Third, we propose a Realistic Vulnerability Generation (RVG) pipeline, which synthesizes context-aware vulnerability examples for underrepresented but critical CWE types through simulated development workflows. Our evaluation reveals that In-Distribution (ID) performance does not reliably predict Out-of-Distribution (OOD) performance on BenchVul. For example, a model trained on BigVul achieves the highest 0.703 ID accuracy but fails on BenchVul's real-world samples (0.493 OOD accuracy). Conversely, a model trained on our TitanVul achieves the highest OOD performance on both the real-world (0.881) and synthesized (0.785) portions of BenchVul, improving upon the next-best performing dataset by 5.3% and 11.8% respectively, despite a modest ID score (0.590). Augmenting TitanVul with our RVG further boosts this leading OOD performance, improving accuracy on real-world data by 5.8% (to 0.932).
翻译:自动化漏洞检测研究已取得实质性进展,但其实际影响仍然有限。先前研究发现,当前的漏洞数据集存在诸多问题,包括20%-71%的标签错误率、大量重复数据以及对关键通用弱点枚举(CWE)的覆盖不足。这些问题导致了显著的泛化鸿沟:模型通过利用虚假相关性而非学习真实的漏洞模式,在分布内(ID)准确率(基于同一数据集的划分进行测试)上取得了误导性的高分数。为应对这些局限,我们提出了一个三部分解决方案。首先,我们引入了BenchVul,这是一个手动整理且平衡的测试数据集,覆盖了MITRE Top 25最危险CWE,以实现公平的模型评估。其次,我们构建了一个高质量训练数据集TitanVul,通过聚合七个公开来源并应用基于新型多智能体LLM流水线的去重和验证,包含了38,548个函数。第三,我们提出了一个现实漏洞生成(RVG)流水线,该流水线通过模拟开发工作流,为代表性不足但关键的CWE类型合成了上下文感知的漏洞示例。我们的评估表明,分布内(ID)性能并不能可靠地预测在BenchVul上的分布外(OOD)性能。例如,在BigVul上训练的模型取得了最高的0.703 ID准确率,但在BenchVul的真实世界样本上却表现不佳(0.493 OOD准确率)。相反,在我们的TitanVul上训练的模型在BenchVul的真实世界部分(0.881)和合成部分(0.785)均取得了最高的OOD性能,分别比次优表现数据集提高了5.3%和11.8%,尽管其ID分数一般(0.590)。使用我们的RVG对TitanVul进行数据增强,进一步提升了这一领先的OOD性能,将真实世界数据的准确率提高了5.8%(达到0.932)。