Background: Software Vulnerability (SV) assessment is increasingly adopted to address the ever-increasing volume and complexity of SVs. Data-driven approaches have been widely used to automate SV assessment tasks, particularly the prediction of the Common Vulnerability Scoring System (CVSS) metrics such as exploitability, impact, and severity. SV assessment suffers from the imbalanced distributions of the CVSS classes, but such data imbalance has been hardly understood and addressed in the literature. Aims: We conduct a large-scale study to quantify the impacts of data imbalance and mitigate the issue for SV assessment through the use of data augmentation. Method: We leverage nine data augmentation techniques to balance the class distributions of the CVSS metrics. We then compare the performance of SV assessment models with and without leveraging the augmented data. Results: Through extensive experiments on 180k+ real-world SVs, we show that mitigating data imbalance can significantly improve the predictive performance of models for all the CVSS tasks, by up to 31.8% in Matthews Correlation Coefficient. We also discover that simple text augmentation like combining random text insertion, deletion, and replacement can outperform the baseline across the board. Conclusions: Our study provides the motivation and the first promising step toward tackling data imbalance for effective SV assessment.
翻译:背景:软件漏洞(SV)评估正被越来越多地采用,以应对日益增长且日益复杂的SV。数据驱动方法已广泛用于自动化SV评估任务,特别是预测通用漏洞评分系统(CVSS)指标,如可利用性、影响和严重性。SV评估受到CVSS类别分布不平衡的影响,但文献中对此类数据不平衡问题鲜有深入理解和解决。目的:我们开展一项大规模研究,以量化数据不平衡的影响,并通过使用数据增强技术来缓解SV评估中的这一问题。方法:我们利用九种数据增强技术来平衡CVSS指标的类别分布。随后,我们比较了使用与不使用增强数据时SV评估模型的性能。结果:通过对超过18万个真实世界SV的广泛实验,我们发现缓解数据不平衡能显著提升模型在所有CVSS任务上的预测性能,马修斯相关系数最高可提升31.8%。我们还发现,简单的文本增强方法(如结合随机文本插入、删除和替换)在所有任务上均能超越基线模型。结论:我们的研究为解决数据不平衡问题以实现有效的SV评估提供了动机和首个有前景的探索方向。