Collecting relevant and high-quality data is integral to the development of effective Software Vulnerability (SV) prediction models. Most of the current SV datasets rely on SV-fixing commits to extract vulnerable functions and lines. However, none of these datasets have considered latent SVs existing between the introduction and fix of the collected SVs. There is also little known about the usefulness of these latent SVs for SV prediction. To bridge these gaps, we conduct a large-scale study on the latent vulnerable functions in two commonly used SV datasets and their utilization for function-level and line-level SV predictions. Leveraging the state-of-the-art SZZ algorithm, we identify more than 100k latent vulnerable functions in the studied datasets. We find that these latent functions can increase the number of SVs by 4x on average and correct up to 5k mislabeled functions, yet they have a noise level of around 6%. Despite the noise, we show that the state-of-the-art SV prediction model can significantly benefit from such latent SVs. The improvements are up to 24.5% in the performance (F1-Score) of function-level SV predictions and up to 67% in the effectiveness of localizing vulnerable lines. Overall, our study presents the first promising step toward the use of latent SVs to improve the quality of SV datasets and enhance the performance of SV prediction tasks.
翻译:收集相关且高质量的数据是开发有效软件漏洞(SV)预测模型的关键。当前大多数SV数据集依赖SV修复提交来提取易受攻击的函数和代码行。然而,这些数据集均未考虑收集到的SV在引入与修复之间存在的潜伏SV。关于这些潜伏SV对SV预测的有用性也知之甚少。为弥补上述不足,我们在两个常用SV数据集中对潜伏易受攻击函数及其在函数级和行级SV预测中的利用展开了大规模研究。借助最先进的SZZ算法,我们在所研究数据集中识别出超过10万个潜伏易受攻击函数。我们发现,这些潜伏函数平均可将SV数量增加4倍,并纠正多达5000个错误标注的函数,但其噪声水平约为6%。尽管存在噪声,我们表明最先进的SV预测模型能从这类潜伏SV中显著获益。在函数级SV预测性能(F1分数)上提升高达24.5%,在定位易受攻击代码行的有效性上提升高达67%。总体而言,我们的研究为利用潜伏SV改善SV数据集质量并提升SV预测任务性能迈出了首个有前景的步骤。