Recent growth and proliferation of malware has tested practitioners' ability to promptly classify new samples according to malware families. In contrast to labor-intensive reverse engineering efforts, machine learning approaches have demonstrated increased speed and accuracy. However, most existing deep-learning malware family classifiers must be calibrated using a large number of samples that are painstakingly manually analyzed before training. Furthermore, as novel malware samples arise that are beyond the scope of the training set, additional reverse engineering effort must be employed to update the training set. The sheer volume of new samples found in the wild creates substantial pressure on practitioners' ability to reverse engineer enough malware to adequately train modern classifiers. In this paper, we present MalMixer, a malware family classifier using semi-supervised learning that achieves high accuracy with sparse training data. We present a novel domain-knowledge-aware technique for augmenting malware feature representations, enhancing few-shot performance of semi-supervised malware family classification. We show that MalMixer achieves state-of-the-art performance in few-shot malware family classification settings. Our research confirms the feasibility and effectiveness of lightweight, domain-knowledge-aware feature augmentation methods and highlights the capabilities of similar semi-supervised classifiers in addressing malware classification issues.
翻译:近年来恶意软件的快速增长与扩散对从业人员及时按家族分类新样本的能力提出了考验。与劳动密集型的逆向工程方法相比,机器学习方法已展现出更高的速度与准确性。然而,现有大多数基于深度学习的恶意软件家族分类器必须在训练前使用大量经过人工精细分析的样本进行校准。此外,当出现超出训练集范围的新型恶意软件样本时,必须投入额外的逆向工程工作来更新训练集。实际环境中发现的新样本数量庞大,给从业人员逆向分析足够恶意软件以充分训练现代分类器的能力带来了巨大压力。本文提出MalMixer——一种采用半监督学习的恶意软件家族分类器,能够在稀疏训练数据下实现高精度分类。我们提出了一种新颖的领域知识感知特征表示增强技术,提升了半监督恶意软件家族分类在少样本场景下的性能。实验表明,MalMixer在少样本恶意软件家族分类任务中达到了最先进的性能水平。本研究证实了轻量级、领域知识感知的特征增强方法的可行性与有效性,并凸显了类似半监督分类器在解决恶意软件分类问题方面的潜力。