Empirical Evaluation of SMOTE in Android Malware Detection with Machine Learning: Challenges and Performance in CICMalDroid 2020

Malware, malicious software designed to damage computer systems and perpetrate scams, is proliferating at an alarming rate, with thousands of new threats emerging daily. Android devices, prevalent in smartphones, smartwatches, tablets, and IoTs, represent a vast attack surface, making malware detection crucial. Although advanced analysis techniques exist, Machine Learning (ML) emerges as a promising tool to automate and accelerate the discovery of these threats. This work tests ML algorithms in detecting malicious code from dynamic execution characteristics. For this purpose, the CICMalDroid2020 dataset, composed of dynamically obtained Android malware behavior samples, was used with the algorithms XGBoost, Naıve Bayes (NB), Support Vector Classifier (SVC), and Random Forest (RF). The study focused on empirically evaluating the impact of the SMOTE technique, used to mitigate class imbalance in the data, on the performance of these models. The results indicate that, in 75% of the tested configurations, the application of SMOTE led to performance degradation or only marginal improvements, with an average loss of 6.14 percentage points. Tree-based algorithms, such as XGBoost and Random Forest, consistently outperformed the others, achieving weighted recall above 94%. It is inferred that SMOTE, although widely used, did not prove beneficial for Android malware detection in the CICMalDroid2020 dataset, possibly due to the complexity and sparsity of dynamic characteristics or the nature of malicious relationships. This work highlights the robustness of tree-ensemble models, such as XGBoost, and suggests that algorithmic data balancing approaches may be more effective than generating synthetic instances in certain cybersecurity scenarios

翻译：恶意软件是一种旨在破坏计算机系统并实施诈骗的恶意程序，其数量正以惊人的速度增长，每天都有数千种新威胁出现。在智能手机、智能手表、平板电脑和物联网设备中广泛使用的Android设备构成了巨大的攻击面，使得恶意软件检测至关重要。尽管存在先进的分析技术，但机器学习作为一种有前景的工具，能够自动化和加速对这些威胁的发现。本研究测试了机器学习算法基于动态执行特征检测恶意代码的能力。为此，我们使用了由动态获取的Android恶意软件行为样本组成的CICMalDroid2020数据集，并应用了XGBoost、朴素贝叶斯、支持向量分类器和随机森林算法。本研究重点实证评估了用于缓解数据类别不平衡的SMOTE技术对这些模型性能的影响。结果表明，在75%的测试配置中，应用SMOTE导致了性能下降或仅有边际改善，平均损失了6.14个百分点。基于树的算法，如XGBoost和随机森林，始终优于其他算法，加权召回率超过94%。据推断，SMOTE虽然被广泛使用，但在CICMalDroid2020数据集的Android恶意软件检测中并未证明有益，这可能是由于动态特征的复杂性和稀疏性，或恶意关联的本质所致。本研究凸显了如XGBoost等树集成模型的鲁棒性，并表明在某些网络安全场景中，算法层面的数据平衡方法可能比生成合成实例更为有效。