Background: Software Vulnerability (SV) prediction needs large-sized and high-quality data to perform well. Current SV datasets mostly require expensive labeling efforts by experts (human-labeled) and thus are limited in size. Meanwhile, there are growing efforts in automatic SV labeling at scale. However, the fitness of auto-labeled data for SV prediction is still largely unknown. Aims: We quantitatively and qualitatively study the quality and use of the state-of-the-art auto-labeled SV data, D2A, for SV prediction. Method: Using multiple sources and manual validation, we curate clean SV data from human-labeled SV-fixing commits in two well-known projects for investigating the auto-labeled counterparts. Results: We discover that 50+% of the auto-labeled SVs are noisy (incorrectly labeled), and they hardly overlap with the publicly reported ones. Yet, SV prediction models utilizing the noisy auto-labeled SVs can perform up to 22% and 90% better in Matthews Correlation Coefficient and Recall, respectively, than the original models. We also reveal the promises and difficulties of applying noise-reduction methods for automatically addressing the noise in auto-labeled SV data to maximize the data utilization for SV prediction. Conclusions: Our study informs the benefits and challenges of using auto-labeled SVs, paving the way for large-scale SV prediction.
翻译:背景:软件漏洞(SV)预测需要大规模高质量数据才能取得良好性能。当前SV数据集大多依赖专家进行昂贵的人工标注,因此规模有限。与此同时,大规模自动SV标注的研究日益增多。然而,自动标注数据对SV预测的适用性仍不明确。目标:我们通过定量与定性方法,研究当前最先进的自动标注SV数据D2A在SV预测中的质量与应用价值。方法:通过多源数据与人工验证,我们从两个知名项目的人工标注漏洞修复提交中筛选出纯净SV数据,用于对比分析自动标注数据。结果:我们发现超过50%的自动标注SV存在噪声(错误标注),且与公开报告的漏洞重叠度极低。然而,利用含噪声自动标注SV的预测模型在Matthews相关系数和召回率上分别比原始模型提升最高达22%和90%。我们还揭示了应用降噪方法自动处理自动标注SV数据中的噪声以最大化SV预测数据利用率的潜力与挑战。结论:本研究阐明了使用自动标注SV数据的优势与挑战,为大规模SV预测研究铺平了道路。