This paper proposes a novel method of classifying malware into families using high-resolution greyscale images and multiple instance learning to overcome adversarial binary enlargement. Current methods of visualisation-based malware classification largely rely on lossy transformations of inputs such as resizing to handle the large, variable-sized images. Through empirical analysis and experimentation, it is shown that these approaches cause crucial information loss that can be exploited. The proposed solution divides the images into patches and uses embedding-based multiple instance learning with a convolutional neural network and an attention aggregation function for classification. The implementation is evaluated on the Microsoft Malware Classification dataset and achieves accuracies of up to $96.6\%$ on adversarially enlarged samples compared to the baseline of $22.8\%$. The Python code is available online at https://github.com/timppeters/MIL-Malware-Images .
翻译:本文提出了一种利用高分辨率灰度图像与多示例学习对恶意软件进行家族分类的新方法,以应对对抗性二进制文件膨胀问题。当前基于可视化的恶意软件分类方法主要依赖有损输入变换(如缩放)来处理尺寸各异的大型图像。通过实证分析与实验发现,这些方法会导致关键信息丢失,进而可能被攻击者利用。本方案将图像分割为图像块,采用基于嵌入的多示例学习框架,结合卷积神经网络与注意力聚合函数进行分类。在微软恶意软件分类数据集上的评估结果表明,该方法对对抗性膨胀样本的分类准确率可达96.6%,而基线方法仅为22.8%。相关Python代码已在GitHub开源:https://github.com/timppeters/MIL-Malware-Images