With the advancement of deep learning (DL) in various fields, there are many attempts to reveal software vulnerabilities by data-driven approach. Nonetheless, such existing works lack the effective representation that can retain the non-sequential semantic characteristics and contextual relationship of source code attributes. Hence, in this work, we propose XGV-BERT, a framework that combines the pre-trained CodeBERT model and Graph Neural Network (GCN) to detect software vulnerabilities. By jointly training the CodeBERT and GCN modules within XGV-BERT, the proposed model leverages the advantages of large-scale pre-training, harnessing vast raw data, and transfer learning by learning representations for training data through graph convolution. The research results demonstrate that the XGV-BERT method significantly improves vulnerability detection accuracy compared to two existing methods such as VulDeePecker and SySeVR. For the VulDeePecker dataset, XGV-BERT achieves an impressive F1-score of 97.5%, significantly outperforming VulDeePecker, which achieved an F1-score of 78.3%. Again, with the SySeVR dataset, XGV-BERT achieves an F1-score of 95.5%, surpassing the results of SySeVR with an F1-score of 83.5%.
翻译:随着深度学习在各领域的进步,众多研究尝试通过数据驱动方法揭示软件漏洞。然而,现有工作缺乏能保留源代码属性中非序列语义特征及上下文关系的有效表征。为此,本文提出XGV-BERT框架,该框架融合预训练的CodeBERT模型与图神经网络(GCN)进行软件漏洞检测。通过联合训练XGV-BERT中的CodeBERT与GCN模块,所提模型充分利用大规模预训练的优势,借助海量原始数据与迁移学习能力,通过图卷积学习训练数据的表征。研究结果表明,相较于VulDeePecker与SySeVR两种现有方法,XGV-BERT方法显著提升了漏洞检测准确性。在VulDeePecker数据集上,XGV-BERT取得97.5%的F1分数,显著超越VulDeePecker的78.3%;而在SySeVR数据集上,XGV-BERT取得95.5%的F1分数,同样优于SySeVR的83.5%。