Vulnerability detection is crucial to protect software security. Nowadays, deep learning (DL) is the most promising technique to automate this detection task, leveraging its superior ability to extract patterns and representations within extensive code volumes. Despite its promise, DL-based vulnerability detection remains in its early stages, with model performance exhibiting variability across datasets. Drawing insights from other well-explored application areas like computer vision, we conjecture that the imbalance issue (the number of vulnerable code is extremely small) is at the core of the phenomenon. To validate this, we conduct a comprehensive empirical study involving nine open-source datasets and two state-of-the-art DL models. The results confirm our conjecture. We also obtain insightful findings on how existing imbalance solutions perform in vulnerability detection. It turns out that these solutions perform differently as well across datasets and evaluation metrics. Specifically: 1) Focal loss is more suitable to improve the precision, 2) mean false error and class-balanced loss encourages the recall, and 3) random over-sampling facilitates the F1-measure. However, none of them excels across all metrics. To delve deeper, we explore external influences on these solutions and offer insights for developing new solutions.
翻译:漏洞检测对于保障软件安全至关重要。当前,深度学习(DL)因其在大量代码中提取模式与表征的卓越能力,成为自动化漏洞检测任务中最具前景的技术。尽管前景广阔,基于深度学习的漏洞检测仍处于早期阶段,其模型性能在不同数据集间存在显著差异。借鉴计算机视觉等其他成熟应用领域的经验,我们推测不平衡问题(漏洞代码数量极少)是导致该现象的核心原因。为验证此假设,我们开展了涵盖九个开源数据集和两种最先进深度学习模型的全面实证研究。结果证实了我们的推测。同时,我们获得了关于现有不平衡解决方案在漏洞检测中表现的重要发现:这些解决方案在不同数据集和评估指标下的表现同样存在差异。具体而言:1)Focal loss更适用于提升精确率;2)平均错误率和类别平衡损失有助于提高召回率;3)随机过采样则能优化F1值。然而,没有任何一种方法能在所有指标上均表现优异。为进一步探究,我们分析了外部因素对这些解决方案的影响,并为开发新解决方案提供了见解。