Machine learning methods can detect Android malware with very high accuracy. However, these classifiers have an Achilles heel, concept drift: they rapidly become out of date and ineffective, due to the evolution of malware apps and benign apps. Our research finds that, after training an Android malware classifier on one year's worth of data, the F1 score quickly dropped from 0.99 to 0.76 after 6 months of deployment on new test samples. In this paper, we propose new methods to combat the concept drift problem of Android malware classifiers. Since machine learning technique needs to be continuously deployed, we use active learning: we select new samples for analysts to label, and then add the labeled samples to the training set to retrain the classifier. Our key idea is, similarity-based uncertainty is more robust against concept drift. Therefore, we combine contrastive learning with active learning. We propose a new hierarchical contrastive learning scheme, and a new sample selection technique to continuously train the Android malware classifier. Our evaluation shows that this leads to significant improvements, compared to previously published methods for active learning. Our approach reduces the false negative rate from 16% (for the best baseline) to 10%, while maintaining the same false positive rate (0.6%). Also, our approach maintains more consistent performance across a seven-year time period than past methods.
翻译:机器学习方法能够以极高的准确率检测安卓恶意软件。然而,这些分类器存在一个致命弱点——概念漂移:由于恶意应用和良性应用的不断演化,它们会迅速变得过时且无效。本研究发现,在使用一年的数据训练安卓恶意软件分类器后,将其部署至新测试样本六个月时,F1分数会从0.99急剧下降至0.76。本文提出新方法以应对安卓恶意软件分类器的概念漂移问题。由于机器学习技术需要持续部署,我们采用主动学习:选择新样本供分析人员标注,随后将标注样本加入训练集以重新训练分类器。我们的核心思想是,基于相似性的不确定性对概念漂移具有更强的鲁棒性。因此,我们将对比学习与主动学习相结合,提出了一种新型层次化对比学习方案及样本选择技术,用于持续训练安卓恶意软件分类器。评估结果表明,与先前发表的主动学习方法相比,本方法取得了显著改进:在保持相同假正率(0.6%)的前提下,将假负率从16%(最佳基线)降至10%。此外,在跨越七年的时间跨度内,本方法比过往方法维持了更稳定的一致性性能表现。