We propose and release a new vulnerable source code dataset. We curate the dataset by crawling security issue websites, extracting vulnerability-fixing commits and source codes from the corresponding projects. Our new dataset contains 150 CWEs, 26,635 vulnerable functions, and 352,606 non-vulnerable functions extracted from 7,861 commits. Our dataset covers 305 more projects than all previous datasets combined. We show that increasing the diversity and volume of training data improves the performance of deep learning models for vulnerability detection. Combining our new dataset with previous datasets, we present an analysis of the challenges and promising research directions of using deep learning for detecting software vulnerabilities. We study 11 model architectures belonging to 4 families. Our results show that deep learning is still not ready for vulnerability detection, due to high false positive rate, low F1 score, and difficulty of detecting hard CWEs. In particular, we demonstrate an important generalization challenge for the deployment of deep learning-based models. However, we also identify hopeful future research directions. We demonstrate that large language models (LLMs) are the future for vulnerability detection, outperforming Graph Neural Networks (GNNs) with manual feature engineering. Moreover, developing source code specific pre-training objectives is a promising research direction to improve the vulnerability detection performance.
翻译:我们提出并发布了一个新的脆弱源代码数据集。通过爬取安全问题网站,从相应项目中提取漏洞修复提交和源代码,我们精心构建了该数据集。该新数据集包含150个CWE类型、26,635个脆弱函数和352,606个非脆弱函数,这些数据提取自7,861次提交。该数据集覆盖的项目数量比以往所有数据集总和多305个。研究表明,增加训练数据的多样性和规模可提升深度学习模型在漏洞检测中的性能。结合新数据集与既有数据集,我们分析了利用深度学习检测软件漏洞的挑战及具有前景的研究方向。我们研究了分属4个家族的11种模型架构。结果表明,由于高误报率、低F1分数以及检测困难CWE的难度,深度学习尚未完全胜任漏洞检测任务。特别是,我们揭示了基于深度学习的模型部署面临的重要泛化挑战。然而,我们也指出了未来有希望的研究方向。我们证明,大型语言模型(LLM)是漏洞检测的未来方向,其性能优于需要手动特征工程的图神经网络(GNN)。此外,开发源代码专用的预训练目标是提升漏洞检测性能的重要研究方向。