Malware attacks have become significantly more frequent and sophisticated in recent years. Therefore, malware detection and classification are critical components of information security. Due to the large amount of malware samples available, it is essential to categorize malware samples according to their malicious characteristics. Clustering algorithms are thus becoming more widely used in computer security to analyze the behavior of malware variants and discover new malware families. Online clustering algorithms help us to understand malware behavior and produce a quicker response to new threats. This paper introduces a novel machine learning-based model for the online clustering of malicious samples into malware families. Streaming data is divided according to the clustering decision rule into samples from known and new emerging malware families. The streaming data is classified using the weighted k-nearest neighbor classifier into known families, and the online k-means algorithm clusters the remaining streaming data and achieves a purity of clusters from 90.20% for four clusters to 93.34% for ten clusters. This work is based on static analysis of portable executable files for the Windows operating system. Experimental results indicate that the proposed online clustering model can create high-purity clusters corresponding to malware families. This allows malware analysts to receive similar malware samples, speeding up their analysis.
翻译:近年来,恶意软件攻击的频率和复杂性显著增加。因此,恶意软件检测与分类成为信息安全的关键组成部分。鉴于可获取的恶意软件样本数量庞大,根据恶意特征对样本进行分类至关重要。聚类算法因此被更广泛地应用于计算机安全领域,以分析恶意软件变种的行为并发现新的恶意软件家族。在线聚类算法有助于理解恶意软件行为,并对新威胁做出更快速的响应。本文提出了一种基于机器学习的新模型,用于将恶意样本在线聚类为恶意软件家族。流数据根据聚类决策规则被划分为来自已知和新兴恶意软件家族的样本。流数据使用加权k近邻分类器分类为已知家族,而在线k均值算法对剩余流数据进行聚类,并实现了从四类聚类的90.20%到十类聚类的93.34%的聚类纯度。本工作基于Windows操作系统可移植可执行文件的静态分析。实验结果表明,所提出的在线聚类模型能够生成对应恶意软件家族的高纯度聚类。这使恶意软件分析师能够获取相似的恶意软件样本,从而加快分析速度。