Machine learning techniques for cybersecurity-related software engineering tasks are becoming increasingly popular. The representation of source code is a key portion of the technique that can impact the way the model is able to learn the features of the source code. With an increasing number of these techniques being developed, it is valuable to see the current state of the field to better understand what exists and what's not there yet. This paper presents a study of these existing ML-based approaches and demonstrates what type of representations were used for different cybersecurity tasks and programming languages. Additionally, we study what types of models are used with different representations. We have found that graph-based representations are the most popular category of representation, and Tokenizers and Abstract Syntax Trees (ASTs) are the two most popular representations overall. We also found that the most popular cybersecurity task is vulnerability detection, and the language that is covered by the most techniques is C. Finally, we found that sequence-based models are the most popular category of models, and Support Vector Machines (SVMs) are the most popular model overall.
翻译:针对网络安全相关软件工程任务的机器学习技术正日益流行。源代码表示是该技术的关键组成部分,能够影响模型学习源代码特征的方式。随着这类技术数量的不断增加,审视该领域现状以更好地理解已有成果与待研究空白具有重要价值。本文对现有基于机器学习的方法进行了研究,揭示了不同网络安全任务和编程语言所使用的表示类型。此外,我们探讨了不同表示所对应的模型类型。研究发现:基于图的表示是最主流的表示类别,而分词器和抽象语法树是两种最常用的具体表示;最热门的网络安全任务是漏洞检测,被最多技术覆盖的语言是C语言;序列模型是最主流的模型类别,支持向量机是使用最广泛的单一模型。