Enhancing Pre-Trained Language Models for Vulnerability Detection via Semantic-Preserving Data Augmentation

With the rapid development and widespread use of advanced network systems, software vulnerabilities pose a significant threat to secure communications and networking. Learning-based vulnerability detection systems, particularly those leveraging pre-trained language models, have demonstrated significant potential in promptly identifying vulnerabilities in communication networks and reducing the risk of exploitation. However, the shortage of accurately labeled vulnerability datasets hinders further progress in this field. Failing to represent real-world vulnerability data variety and preserve vulnerability semantics, existing augmentation approaches provide limited or even counterproductive contributions to model training. In this paper, we propose a data augmentation technique aimed at enhancing the performance of pre-trained language models for vulnerability detection. Given the vulnerability dataset, our method performs natural semantic-preserving program transformation to generate a large volume of new samples with enriched data diversity and variety. By incorporating our augmented dataset in fine-tuning a series of representative code pre-trained models (i.e., CodeBERT, GraphCodeBERT, UnixCoder, and PDBERT), up to 10.1% increase in accuracy and 23.6% increase in F1 can be achieved in the vulnerability detection task. Comparison results also show that our proposed method can substantially outperform other prominent vulnerability augmentation approaches.

翻译：随着先进网络系统的快速发展和广泛应用，软件漏洞对安全通信与网络构成重大威胁。基于学习的漏洞检测系统，特别是利用预训练语言模型的系统，在及时识别通信网络漏洞和降低利用风险方面展现出显著潜力。然而，准确标注的漏洞数据集短缺阻碍了该领域的进一步发展。现有数据增强方法因未能充分表征现实漏洞数据的多样性且未能保持漏洞语义，对模型训练提供的改进有限甚至产生负面影响。本文提出一种旨在提升预训练语言模型漏洞检测性能的数据增强技术。针对给定漏洞数据集，本方法通过执行自然语义保持的程序变换，生成大量具有丰富数据多样性的新样本。通过将增强数据集用于微调一系列代表性代码预训练模型（包括CodeBERT、GraphCodeBERT、UnixCoder和PDBERT），在漏洞检测任务中可实现最高10.1%的准确率提升和23.6%的F1值提升。对比实验结果同时表明，本方法显著优于其他主流漏洞数据增强方法。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日