Evaluating ML-Based Anomaly Detection Across Datasets of Varied Integrity: A Case Study

Cybersecurity remains a critical challenge in the digital age, with network traffic flow anomaly detection being a key pivotal instrument in the fight against cyber threats. In this study, we address the prevalent issue of data integrity in network traffic datasets, which are instrumental in developing machine learning (ML) models for anomaly detection. We introduce two refined versions of the CICIDS-2017 dataset, NFS-2023-nTE and NFS-2023-TE, processed using NFStream to ensure methodologically sound flow expiration and labeling. Our research contrasts the performance of the Random Forest (RF) algorithm across the original CICIDS-2017, its refined counterparts WTMC-2021 and CRiSIS-2022, and our NFStream-generated datasets, in both binary and multi-class classification contexts. We observe that the RF model exhibits exceptional robustness, achieving consistent high-performance metrics irrespective of the underlying dataset quality, which prompts a critical discussion on the actual impact of data integrity on ML efficacy. Our study underscores the importance of continual refinement and methodological rigor in dataset generation for network security research. As the landscape of network threats evolves, so must the tools and techniques used to detect and analyze them.

翻译：网络安全在数字时代仍是一项关键挑战，网络流量异常检测作为抵御网络威胁的关键工具发挥着重要作用。本研究针对网络流量数据集中普遍存在的完整性问题，这些数据集对于开发用于异常检测的机器学习（ML）模型至关重要。我们引入了CICIDS-2017数据集的两个改进版本——NFS-2023-nTE和NFS-2023-TE，这些数据集使用NFStream进行处理，以确保方法上合理的流超时和标记。我们的研究对比了随机森林（RF）算法在原始CICIDS-2017、其改进版WTMC-2021和CRiSIS-2022以及我们生成的NFStream数据集上的性能，涵盖了二分类和多分类场景。我们观察到RF模型表现出卓越的鲁棒性，无论底层数据集质量如何，都能保持一致的高性能指标，这引发了关于数据完整性对机器学习效能实际影响的批判性讨论。本研究强调了在网络安全研究的数据集生成中持续改进和方法严谨性的重要性。随着网络威胁格局的演变，用于检测和分析这些威胁的工具和技术也必须随之发展。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日