NetFlow data is a popular network log format used by many network analysts and researchers. The advantages of using NetFlow over deep packet inspection are that it is easier to collect and process, and it is less privacy intrusive. Many works have used machine learning to detect network attacks using NetFlow data. The first step for these machine learning pipelines is to pre-process the data before it is given to the machine learning algorithm. Many approaches exist to pre-process NetFlow data; however, these simply apply existing methods to the data, not considering the specific properties of network data. We argue that for data originating from software systems, such as NetFlow or software logs, similarities in frequency and contexts of feature values are more important than similarities in the value itself. In this work, we propose an encoding algorithm that directly takes the frequency and the context of the feature values into account when the data is being processed. Different types of network behaviours can be clustered using this encoding, thus aiding the process of detecting anomalies within the network. We train several machine learning models for anomaly detection using the data that has been encoded with our encoding algorithm. We evaluate the effectiveness of our encoding on a new dataset that we created for network attacks on Kubernetes clusters and two well-known public NetFlow datasets. We empirically demonstrate that the machine learning models benefit from using our encoding for anomaly detection.
翻译:NetFlow数据是网络分析师和研究人员广泛使用的一种网络日志格式。与深度包检测相比,NetFlow的优势在于更易采集和处理,且对隐私侵犯程度更低。许多研究已利用机器学习方法基于NetFlow数据进行网络攻击检测。此类机器学习管线的首要步骤是在数据输入算法前进行预处理。现有多种NetFlow数据预处理方法,但这些方法仅是将通用技术应用于数据,未考虑网络数据的特有属性。我们认为,对于源自软件系统的数据(如NetFlow或软件日志),特征值频率和上下文语境的相似性比数值本身的相似性更为重要。本文提出一种编码算法,在处理数据时直接纳入特征值的频率与上下文信息。通过该编码方法可将不同类型的网络行为进行聚类,从而辅助网络内部异常检测。我们使用经编码算法处理后的数据训练了多个异常检测机器学习模型,并在针对Kubernetes集群网络攻击的新建数据集以及两个公开的NetFlow基准数据集上评估了编码效果。实验证明,所提编码方法能有效提升机器学习模型在异常检测任务中的性能。