NetFlow data is a popular network log format used by many network analysts and researchers. The advantages of using NetFlow over deep packet inspection are that it is easier to collect and process, and it is less privacy intrusive. Many works have used machine learning to detect network attacks using NetFlow data. The first step for these machine learning pipelines is to pre-process the data before it is given to the machine learning algorithm. Many approaches exist to pre-process NetFlow data; however, these simply apply existing methods to the data, not considering the specific properties of network data. We argue that for data originating from software systems, such as NetFlow or software logs, similarities in frequency and contexts of feature values are more important than similarities in the value itself. In this work, we propose an encoding algorithm that directly takes the frequency and the context of the feature values into account when the data is being processed. Different types of network behaviours can be clustered using this encoding, thus aiding the process of detecting anomalies within the network. We train several machine learning models for anomaly detection using the data that has been encoded with our encoding algorithm. We evaluate the effectiveness of our encoding on a new dataset that we created for network attacks on Kubernetes clusters and two well-known public NetFlow datasets. We empirically demonstrate that the machine learning models benefit from using our encoding for anomaly detection.
翻译:NetFlow数据是一种广泛使用的网络日志格式,受到众多网络分析师与研究人员的青睐。相较于深度包检测技术,NetFlow的优势在于更易于采集处理且对隐私的侵入性更低。已有许多研究利用机器学习技术基于NetFlow数据进行网络攻击检测。这类机器学习流程的第一步是在数据输入算法前进行预处理。当前存在多种NetFlow数据预处理方法,但这些方法往往直接套用通用处理方案,未能充分考虑网络数据特有的属性。我们认为,对于源自软件系统的数据(如NetFlow或软件日志),特征值在频次与上下文层面的相似性比特征值本身的相似性更为重要。本研究提出一种编码算法,在数据处理过程中直接考量特征值的出现频次与上下文关联。该编码方法能够对不同类型的网络行为进行聚类,从而有效辅助网络异常检测流程。我们使用经本编码算法处理的数据训练了多种机器学习模型用于异常检测,并在自建的Kubernetes集群网络攻击数据集及两个知名公开NetFlow数据集上评估了编码方法的有效性。实验结果表明,采用本编码方法能显著提升机器学习模型在异常检测任务中的性能表现。