Machine Learning for Network Attacks Classification and Statistical Evaluation of Machine Learning for Network Attacks Classification and Adversarial Learning Methodologies for Synthetic Data Generation

翻译：面向网络攻击分类的机器学习方法及统计评估：面向网络攻击分类的机器学习与对抗学习合成数据生成方法的统计评估

Iakovos-Christos Zarkadis,Christos Douligeris

Supervised detection of network attacks has always been a critical part of network intrusion detection systems (NIDS). Nowadays, in a pivotal time for artificial intelligence (AI), with even more sophisticated attacks that utilize advanced techniques, such as generative artificial intelligence (GenAI) and reinforcement learning, it has become a vital component if we wish to protect our personal data, which are scattered across the web. In this paper, we address two tasks, in the first unified multi-modal NIDS dataset, which incorporates flow-level data, packet payload information and temporal contextual features, from the reprocessed CIC-IDS-2017, CIC-IoT-2023, UNSW-NB15 and CIC-DDoS-2019, with the same feature space. In the first task we use machine learning (ML) algorithms, with stratified cross validation, in order to prevent network attacks, with stability and reliability. In the second task we use adversarial learning algorithms to generate synthetic data, compare them with the real ones and evaluate their fidelity, utility and privacy using the SDV framework, f-divergences, distinguishability and non-parametric statistical tests. The findings provide stable ML models for intrusion detection and generative models with high fidelity and utility, by combining the Synthetic Data Vault framework, the TRTS and TSTR tests, with non-parametric statistical tests and f-divergence measures.

翻译：网络攻击的监督检测一直是网络入侵检测系统（NIDS）的关键组成部分。当今，在人工智能（AI）发展的关键时期，随着使用生成式人工智能（GenAI）和强化学习等先进技术的攻击愈发复杂，若想保护散落在网络中的个人数据，这已成为一项至关重要的功能。本文针对首个统一的、融合了流级数据、数据包载荷信息及时间上下文特征的多模态NIDS数据集展开研究，该数据集源自经过再处理的CIC-IDS-2017、CIC-IoT-2023、UNSW-NB15和CIC-DDoS-2019，且具有相同的特征空间。在第一项任务中，我们采用机器学习（ML）算法并借助分层交叉验证，以稳定可靠地防御网络攻击。在第二项任务中，我们利用对抗学习算法生成合成数据，将其与真实数据进行对比，并通过SDV框架、f-散度、可区分性以及非参数统计检验评估其保真度、效用性和隐私性。通过结合合成数据保险库（Synthetic Data Vault）框架、TRTS与TSTR测试、非参数统计检验以及f-散度度量，研究结果提供了稳定的入侵检测机器学习模型以及具有高保真度和高效用性的生成模型。