A large amount of new malware is constantly being generated, which must not only be distinguished from benign samples, but also classified into malware families. For this purpose, investigating how existing malware families are developed and examining emerging families need to be explored. This paper focuses on the online processing of incoming malicious samples to assign them to existing families or, in the case of samples from new families, to cluster them. We experimented with seven prevalent malware families from the EMBER dataset, four in the training set and three additional new families in the test set. Based on the classification score of the multilayer perceptron, we determined which samples would be classified and which would be clustered into new malware families. We classified 97.21% of streaming data with a balanced accuracy of 95.33%. Then, we clustered the remaining data using a self-organizing map, achieving a purity from 47.61% for four clusters to 77.68% for ten clusters. These results indicate that our approach has the potential to be applied to the classification and clustering of zero-day malware into malware families.
翻译:大量新型恶意软件不断涌现,这些样本不仅需要与良性样本区分,还需归入不同的恶意软件家族。为此,需探究现有恶意软件家族的演变规律,同时兼顾新兴家族的分析需求。本文聚焦于在线处理流入的恶意样本:将其归类至已有家族,或针对新家族样本进行聚类。我们选取EMBER数据集中七个主流恶意软件家族进行实验,其中四个作为训练集家族,另外三个新增家族作为测试集家族。基于多层感知机的分类得分,我们判定哪些样本可直接分类,哪些需聚类为新型恶意软件家族。该分类方法对97.21%的流数据实现了95.33%的平衡准确率。随后,采用自组织映射对剩余数据聚类,当聚类数从4增至10时,纯度指标从47.61%提升至77.68%。实验结果表明,该方法具备将零日恶意软件分类并聚类至相应家族的潜力。