Cybersecurity remains a critical challenge in the digital age, with network traffic flow anomaly detection being a key pivotal instrument in the fight against cyber threats. In this study, we address the prevalent issue of data integrity in network traffic datasets, which are instrumental in developing machine learning (ML) models for anomaly detection. We introduce two refined versions of the CICIDS-2017 dataset, NFS-2023-nTE and NFS-2023-TE, processed using NFStream to ensure methodologically sound flow expiration and labeling. Our research contrasts the performance of the Random Forest (RF) algorithm across the original CICIDS-2017, its refined counterparts WTMC-2021 and CRiSIS-2022, and our NFStream-generated datasets, in both binary and multi-class classification contexts. We observe that the RF model exhibits exceptional robustness, achieving consistent high-performance metrics irrespective of the underlying dataset quality, which prompts a critical discussion on the actual impact of data integrity on ML efficacy. Our study underscores the importance of continual refinement and methodological rigor in dataset generation for network security research. As the landscape of network threats evolves, so must the tools and techniques used to detect and analyze them.
翻译:网络安全在数字时代仍是一项关键挑战,网络流量异常检测作为抵御网络威胁的关键工具发挥着重要作用。本研究针对网络流量数据集中普遍存在的完整性问题,这些数据集对于开发用于异常检测的机器学习(ML)模型至关重要。我们引入了CICIDS-2017数据集的两个改进版本——NFS-2023-nTE和NFS-2023-TE,这些数据集使用NFStream进行处理,以确保方法上合理的流超时和标记。我们的研究对比了随机森林(RF)算法在原始CICIDS-2017、其改进版WTMC-2021和CRiSIS-2022以及我们生成的NFStream数据集上的性能,涵盖了二分类和多分类场景。我们观察到RF模型表现出卓越的鲁棒性,无论底层数据集质量如何,都能保持一致的高性能指标,这引发了关于数据完整性对机器学习效能实际影响的批判性讨论。本研究强调了在网络安全研究的数据集生成中持续改进和方法严谨性的重要性。随着网络威胁格局的演变,用于检测和分析这些威胁的工具和技术也必须随之发展。