With the rapid development and widespread use of advanced network systems, software vulnerabilities pose a significant threat to secure communications and networking. Learning-based vulnerability detection systems, particularly those leveraging pre-trained language models, have demonstrated significant potential in promptly identifying vulnerabilities in communication networks and reducing the risk of exploitation. However, the shortage of accurately labeled vulnerability datasets hinders further progress in this field. Failing to represent real-world vulnerability data variety and preserve vulnerability semantics, existing augmentation approaches provide limited or even counterproductive contributions to model training. In this paper, we propose a data augmentation technique aimed at enhancing the performance of pre-trained language models for vulnerability detection. Given the vulnerability dataset, our method performs natural semantic-preserving program transformation to generate a large volume of new samples with enriched data diversity and variety. By incorporating our augmented dataset in fine-tuning a series of representative code pre-trained models (i.e., CodeBERT, GraphCodeBERT, UnixCoder, and PDBERT), up to 10.1% increase in accuracy and 23.6% increase in F1 can be achieved in the vulnerability detection task. Comparison results also show that our proposed method can substantially outperform other prominent vulnerability augmentation approaches.
翻译:随着先进网络系统的快速发展和广泛应用,软件漏洞对安全通信与网络构成重大威胁。基于学习的漏洞检测系统,特别是利用预训练语言模型的系统,在及时识别通信网络漏洞和降低利用风险方面展现出显著潜力。然而,准确标注的漏洞数据集短缺阻碍了该领域的进一步发展。现有数据增强方法因未能充分表征现实漏洞数据的多样性且未能保持漏洞语义,对模型训练提供的改进有限甚至产生负面影响。本文提出一种旨在提升预训练语言模型漏洞检测性能的数据增强技术。针对给定漏洞数据集,本方法通过执行自然语义保持的程序变换,生成大量具有丰富数据多样性的新样本。通过将增强数据集用于微调一系列代表性代码预训练模型(包括CodeBERT、GraphCodeBERT、UnixCoder和PDBERT),在漏洞检测任务中可实现最高10.1%的准确率提升和23.6%的F1值提升。对比实验结果同时表明,本方法显著优于其他主流漏洞数据增强方法。