Decoding the Secrets of Machine Learning in Malware Classification: A Deep Dive into Datasets, Feature Extraction, and Model Performance

Many studies have proposed machine-learning (ML) models for malware detection and classification, reporting an almost-perfect performance. However, they assemble ground-truth in different ways, use diverse static- and dynamic-analysis techniques for feature extraction, and even differ on what they consider a malware family. As a consequence, our community still lacks an understanding of malware classification results: whether they are tied to the nature and distribution of the collected dataset, to what extent the number of families and samples in the training dataset influence performance, and how well static and dynamic features complement each other. This work sheds light on those open questions. by investigating the key factors influencing ML-based malware detection and classification. For this, we collect the largest balanced malware dataset so far with 67K samples from 670 families (100 samples each), and train state-of-the-art models for malware detection and family classification using our dataset. Our results reveal that static features perform better than dynamic features, and that combining both only provides marginal improvement over static features. We discover no correlation between packing and classification accuracy, and that missing behaviors in dynamically-extracted features highly penalize their performance. We also demonstrate how a larger number of families to classify make the classification harder, while a higher number of samples per family increases accuracy. Finally, we find that models trained on a uniform distribution of samples per family better generalize on unseen data.

翻译：大量研究提出了用于恶意软件检测与分类的机器学习模型，并报告了近乎完美的性能。然而，这些研究在构建真实标注的方式上各不相同，使用不同的静态与动态分析技术进行特征提取，甚至对恶意软件家族的界定也存在差异。因此，学术界仍缺乏对恶意软件分类结果的深入理解：这些结果是否与所收集数据集的性质和分布相关？训练数据集中家族数量与样本数量对性能的影响程度如何？静态与动态特征的互补性又有多强？本研究将聚焦这些开放性问题，通过探究影响基于机器学习的恶意软件检测与分类的关键因素来阐明真相。为此，我们收集了迄今为止最大的平衡恶意软件数据集，包含来自670个家族的6.7万个样本（每个家族100个样本），并利用该数据集训练了最先进的恶意软件检测与家族分类模型。研究结果表明：静态特征的表现优于动态特征，且两者结合仅能在静态特征基础上带来微小的性能提升；打包与分类准确率之间无相关性；动态提取特征中缺失的行为会严重损害其性能；待分类家族数量越多，分类难度越大，而每个家族的样本数量越多，准确率越高。最后，我们发现基于每个家族样本均匀分布训练的模型在未见数据上具有更好的泛化能力。