Identification of the family to which a malware specimen belongs is essential in understanding the behavior of the malware and developing mitigation strategies. Solutions proposed by prior work, however, are often not practicable due to the lack of realistic evaluation factors. These factors include learning under class imbalance, the ability to identify new malware, and the cost of production-quality labeled data. In practice, deployed models face prominent, rare, and new malware families. At the same time, obtaining a large quantity of up-to-date labeled malware for training a model can be expensive. In this paper, we address these problems and propose a novel hierarchical semi-supervised algorithm, which we call the HNMFk Classifier, that can be used in the early stages of the malware family labeling process. Our method is based on non-negative matrix factorization with automatic model selection, that is, with an estimation of the number of clusters. With HNMFk Classifier, we exploit the hierarchical structure of the malware data together with a semi-supervised setup, which enables us to classify malware families under conditions of extreme class imbalance. Our solution can perform abstaining predictions, or rejection option, which yields promising results in the identification of novel malware families and helps with maintaining the performance of the model when a low quantity of labeled data is used. We perform bulk classification of nearly 2,900 both rare and prominent malware families, through static analysis, using nearly 388,000 samples from the EMBER-2018 corpus. In our experiments, we surpass both supervised and semi-supervised baseline models with an F1 score of 0.80.
翻译:恶意软件样本所属家族的识别对于理解恶意软件行为及制定防御策略至关重要。然而,现有工作提出的解决方案往往因缺乏实际评估因素而不具备可行性。这些因素包括:在类别不平衡条件下学习的能力、识别新型恶意软件的能力,以及生产级标注数据的成本。实践中,部署的模型面临常见、罕见及新型恶意软件家族。同时,获取大量最新标注恶意样本用于模型训练可能代价高昂。本文针对这些问题提出一种新型层次化半监督算法,命名为HNMFk分类器,可用于恶意软件家族标注流程的早期阶段。该方法基于具备自动模型选择(即聚类数估计)的非负矩阵分解。通过HNMFk分类器,我们利用恶意软件数据的层次结构并结合半监督设置,从而能在极端类别不平衡条件下对恶意软件家族进行分类。我们的解决方案具备弃权预测能力(即拒绝选项),在识别新型恶意软件家族方面展现出令人鼓舞的效果,并有助于在使用少量标注数据时维持模型性能。我们通过静态分析,基于EMBER-2018语料库中近388,000个样本,对近2,900个常见与罕见恶意软件家族进行批量分类。实验中,我们的方法以0.80的F1分数超越了监督与半监督基线模型。