Machine learning methods can detect Android malware with very high accuracy. However, these classifiers have an Achilles heel, concept drift: they rapidly become out of date and ineffective, due to the evolution of malware apps and benign apps. Our research finds that, after training an Android malware classifier on one year's worth of data, the F1 score quickly dropped from 0.99 to 0.76 after 6 months of deployment on new test samples. In this paper, we propose new methods to combat the concept drift problem of Android malware classifiers. Since machine learning technique needs to be continuously deployed, we use active learning: we select new samples for analysts to label, and then add the labeled samples to the training set to retrain the classifier. Our key idea is, similarity-based uncertainty is more robust against concept drift. Therefore, we combine contrastive learning with active learning. We propose a new hierarchical contrastive learning scheme, and a new sample selection technique to continuously train the Android malware classifier. Our evaluation shows that this leads to significant improvements, compared to previously published methods for active learning. Our approach reduces the false negative rate from 14% (for the best baseline) to 9%, while also reducing the false positive rate (from 0.86% to 0.48%). Also, our approach maintains more consistent performance across a seven-year time period than past methods.
翻译:机器学习方法能够以极高精度检测安卓恶意软件。然而,这些分类器存在一个致命弱点——概念漂移:由于恶意应用与良性应用的持续演化,它们会迅速变得过时且失效。我们的研究发现,当使用一年数据训练安卓恶意软件分类器后,在部署新测试样本的6个月内,F1得分从0.99急剧下降至0.76。本文提出新方法以应对安卓恶意软件分类器的概念漂移问题。由于机器学习技术需要持续部署,我们采用主动学习:选择新样本供分析人员标注,随后将标注样本加入训练集以重训练分类器。我们的核心思想在于,基于相似性的不确定性对概念漂移具有更强的鲁棒性。因此,我们将对比学习与主动学习相结合:提出新型层次化对比学习方案,以及持续训练安卓恶意软件分类器的样本选择技术。评估结果表明,与先前发表的主动学习方法相比,本方案带来显著改进。该方法将假阴性率从14%(最佳基线方法)降至9%,同时将假阳性率从0.86%降至0.48%。此外,与以往方法相比,我们的方法在七年时间跨度内保持了更一致的表现。