The impact of software vulnerabilities on everyday software systems is significant. Despite deep learning models being proposed for vulnerability detection, their reliability is questionable. Prior evaluations show high recall/F1 scores of up to 99%, but these models underperform in practical scenarios, particularly when assessed on entire codebases rather than just the fixing commit. This paper introduces Real-Vul, a comprehensive dataset representing real-world scenarios for evaluating vulnerability detection models. Evaluating DeepWukong, LineVul, ReVeal, and IVDetect shows a significant drop in performance, with precision decreasing by up to 95 percentage points and F1 scores by up to 91 points. Furthermore, Model performance fluctuates based on vulnerability characteristics, with better F1 scores for information leaks or code injection than for path resolution or predictable return values. The results highlight a significant performance gap that needs addressing before deploying deep learning-based vulnerability detection in practical settings. Overfitting is identified as a key issue, and an augmentation technique is proposed, potentially improving performance by up to 30%. Contributions include a dataset creation approach for better model evaluation, Real-Vul dataset, and empirical evidence of deep learning models struggling in real-world settings.
翻译:软件漏洞对日常软件系统的影响是显著的。尽管已有深度学习模型被提出用于漏洞检测,但其可靠性仍存疑。先前的评估显示召回率/F1分数高达99%,但这些模型在实际场景中表现不佳,尤其是在对整个代码库而非仅修复提交进行评估时。本文介绍了Real-Vul,一个代表真实世界场景的综合数据集,用于评估漏洞检测模型。对DeepWukong、LineVul、ReVeal和IVDetect的评估显示性能显著下降,精确度最多降低95个百分点,F1分数最多降低91点。此外,模型性能根据漏洞特征存在波动,对于信息泄露或代码注入的F1分数优于路径解析或可预测返回值的情况。结果凸显了在实际部署基于深度学习的漏洞检测之前需要解决的显著性能差距。过拟合被确定为一个关键问题,并提出了一种增强技术,可能将性能提升高达30%。贡献包括用于改进模型评估的数据集构建方法、Real-Vul数据集,以及深度学习模型在真实世界环境中表现不佳的经验证据。