Developing automated and smart software vulnerability detection models has been receiving great attention from both research and development communities. One of the biggest challenges in this area is the lack of code samples for all different programming languages. In this study, we address this issue by proposing a transfer learning technique to leverage available datasets and generate a model to detect common vulnerabilities in different programming languages. We use C source code samples to train a Convolutional Neural Network (CNN) model, then, we use Java source code samples to adopt and evaluate the learned model. We use code samples from two benchmark datasets: NIST Software Assurance Reference Dataset (SARD) and Draper VDISC dataset. The results show that proposed model detects vulnerabilities in both C and Java codes with average recall of 72\%. Additionally, we employ explainable AI to investigate how much each feature contributes to the knowledge transfer mechanisms between C and Java in the proposed model.
翻译:开发自动化与智能化的软件漏洞检测模型一直是研究和开发领域关注的焦点。该领域面临的最大挑战之一是缺乏涵盖所有编程语言的代码样本。本研究针对此问题提出一种迁移学习技术,通过利用现有数据集生成可用于检测不同编程语言中常见漏洞的模型。我们首先使用C语言源代码样本训练卷积神经网络(CNN)模型,随后采用Java源代码样本对所学模型进行适配与评估。实验采用两个基准数据集:美国国家标准与技术研究院软件保障参考数据集(SARD)与Draper VDISC数据集。结果表明,所提模型对C语言和Java代码的漏洞检测平均召回率达到72%。此外,我们运用可解释人工智能探究各特征对模型在C语言与Java语言间知识迁移机制的贡献程度。