The importance of the clustering model to detect new types of intrusion in data traffic

In the current digital age, the volume of data generated by various cyber activities has become enormous and is constantly increasing. The data may contain valuable insights that can be harnessed to improve cyber security measures. However, much of this data is unclassified and qualitative, which poses significant challenges to traditional analysis methods. Clustering facilitates the identification of hidden patterns and structures in data through grouping similar data points, which makes it simpler to identify and address threats. Clustering can be defined as a data mining (DM) approach, which uses similarity calculations for dividing a data set into several categories. Hierarchical, density-based, along with partitioning clustering algorithms are typical. The presented work use K-means algorithm, which is a popular clustering technique. Utilizing K-means algorithm, we worked with two different types of data: first, we gathered data with the use of XG-boost algorithm following completing the aggregation with K-means algorithm. Data was gathered utilizing Kali Linux environment, cicflowmeter traffic, and Putty Software tools with the use of diverse and simple attacks. The concept could assist in identifying new attack types, which are distinct from the known attacks, and labeling them based on the characteristics they will exhibit, as the dynamic nature regarding cyber threats means that new attack types often emerge, for which labeled data might not yet exist. The model counted the attacks and assigned numbers to each one of them. Secondly, We tried the same work on the ready data inside the Kaggle repository called (Intrusion Detection in Internet of Things Network), and the clustering model worked well and detected the number of attacks correctly as shown in the results section.

翻译：在当今数字时代，各类网络活动产生的数据量已变得极为庞大且持续增长。这些数据可能包含可用于改进网络安全措施的有价值信息。然而，此类数据大多未经分类且具有定性特征，这对传统分析方法构成了重大挑战。聚类通过将相似数据点分组，有助于识别数据中隐藏的模式与结构，从而更简便地识别和处理威胁。聚类可定义为一种数据挖掘方法，其通过相似性计算将数据集划分为若干类别。层次聚类、基于密度的聚类以及划分聚类算法是典型代表。本研究采用流行的聚类技术——K-means算法。运用该算法，我们处理了两种不同类型的数据：首先，在完成K-means算法聚合后，我们使用XG-boost算法收集数据。数据采集通过Kali Linux环境、cicflowmeter流量分析工具和Putty软件工具实现，并采用了多样化的简单攻击方式。该概念有助于识别与已知攻击不同的新型攻击，并根据其表现特征进行标记——鉴于网络威胁的动态性，新型攻击手段不断涌现，而对应的标记数据可能尚未存在。该模型对攻击行为进行计数并为每个攻击分配编号。其次，我们在Kaggle存储库中名为"物联网网络入侵检测"的现有数据上进行了相同实验，聚类模型表现良好，准确检测出了攻击数量，如结果部分所示。