A large amount of new malware is constantly being generated, which must not only be distinguished from benign samples, but also classified into malware families. For this purpose, investigating how existing malware families are developed and examining emerging families need to be explored. This paper focuses on the online processing of incoming malicious samples to assign them to existing families or, in the case of samples from new families, to cluster them. We experimented with seven prevalent malware families from the EMBER dataset, with four in the training set and three additional new families in the test set. Based on the classification score of the multilayer perceptron, we determined which samples would be classified and which would be clustered into new malware families. We classified 97.21% of streaming data with a balanced accuracy of 95.33%. Then, we clustered the remaining data using a self-organizing map, achieving a purity from 47.61% for four clusters to 77.68% for ten clusters. These results indicate that our approach has the potential to be applied to the classification and clustering of zero-day malware into malware families.
翻译:大量新型恶意软件正不断生成,不仅需要将其与良性样本区分开来,还需归类至不同的恶意软件家族。为此,需探究现有恶意软件家族的演变规律,并分析新兴家族的出现。本文聚焦于对流入的恶意样本进行在线处理,将其分配至已有家族,或对源自新家族的样本进行聚类。我们在EMBER数据集上对七个常见恶意软件家族进行实验,其中训练集包含四个家族,测试集额外引入三个新家族。基于多层感知机的分类得分,我们确定哪些样本应被分类,哪些应被聚类为新恶意软件家族。我们以95.33%的平衡准确率对97.21%的流数据实现了分类,随后利用自组织映射对剩余数据进行聚类,在四类聚类中纯度为47.61%,在十类聚类中纯度达77.68%。实验结果表明,本方法有望应用于零日恶意软件的家族分类与聚类。