A Survey on Automated Software Vulnerability Detection Using Machine Learning and Deep Learning

Software vulnerability detection is critical in software security because it identifies potential bugs in software systems, enabling immediate remediation and mitigation measures to be implemented before they may be exploited. Automatic vulnerability identification is important because it can evaluate large codebases more efficiently than manual code auditing. Many Machine Learning (ML) and Deep Learning (DL) based models for detecting vulnerabilities in source code have been presented in recent years. However, a survey that summarises, classifies, and analyses the application of ML/DL models for vulnerability detection is missing. It may be difficult to discover gaps in existing research and potential for future improvement without a comprehensive survey. This could result in essential areas of research being overlooked or under-represented, leading to a skewed understanding of the state of the art in vulnerability detection. This work address that gap by presenting a systematic survey to characterize various features of ML/DL-based source code level software vulnerability detection approaches via five primary research questions (RQs). Specifically, our RQ1 examines the trend of publications that leverage ML/DL for vulnerability detection, including the evolution of research and the distribution of publication venues. RQ2 describes vulnerability datasets used by existing ML/DL-based models, including their sources, types, and representations, as well as analyses of the embedding techniques used by these approaches. RQ3 explores the model architectures and design assumptions of ML/DL-based vulnerability detection approaches. RQ4 summarises the type and frequency of vulnerabilities that are covered by existing studies. Lastly, RQ5 presents a list of current challenges to be researched and an outline of a potential research roadmap that highlights crucial opportunities for future work.

翻译：软件漏洞检测在软件安全中至关重要，因为它能够识别软件系统中的潜在缺陷，使其在可能被利用之前能够立即实施补救和缓解措施。自动漏洞识别之所以重要，是因为它能够比手动代码审计更高效地评估大型代码库。近年来，许多基于机器学习（ML）和深度学习（DL）的模型被提出用于检测源代码中的漏洞。然而，目前尚缺乏一份能够总结、分类并分析ML/DL模型在漏洞检测中应用的综述。没有全面的综述，可能难以发现现有研究中的空白和未来改进的潜力。这可能导致关键研究领域被忽视或代表性不足，从而对漏洞检测领域的技术现状产生偏差理解。本研究通过系统性的综述工作填补了这一空白，围绕五个主要研究问题（RQ）来刻画基于ML/DL的源代码级软件漏洞检测方法的多种特征。具体而言，RQ1考察了利用ML/DL进行漏洞检测的论文发表趋势，包括研究演化过程及发表场所的分布情况。RQ2描述了现有ML/DL模型所使用的漏洞数据集，涵盖其来源、类型和表示方式，并分析了这些方法所采用的嵌入技术。RQ3探讨了基于ML/DL的漏洞检测方法的模型架构与设计假设。RQ4总结了现有研究涵盖的漏洞类型及其出现频率。最后，RQ5列出了当前有待研究的一系列挑战，并勾勒出潜在的研究路线图，突出未来工作中至关重要的机遇。