The rapidly evolving nature of Android apps poses a significant challenge to static batch machine learning algorithms employed in malware detection systems, as they quickly become obsolete. Despite this challenge, the existing literature pays limited attention to addressing this issue, with many advanced Android malware detection approaches, such as Drebin, DroidDet and MaMaDroid, relying on static models. In this work, we show how retraining techniques are able to maintain detector capabilities over time. Particularly, we analyze the effect of two aspects in the efficiency and performance of the detectors: 1) the frequency with which the models are retrained, and 2) the data used for retraining. In the first experiment, we compare periodic retraining with a more advanced concept drift detection method that triggers retraining only when necessary. In the second experiment, we analyze sampling methods to reduce the amount of data used to retrain models. Specifically, we compare fixed sized windows of recent data and state-of-the-art active learning methods that select those apps that help keep the training dataset small but diverse. Our experiments show that concept drift detection and sample selection mechanisms result in very efficient retraining strategies which can be successfully used to maintain the performance of the static Android malware state-of-the-art detectors in changing environments.
翻译:安卓应用快速演变的特性对静态批处理机器学习算法在恶意软件检测系统中的应用构成重大挑战,导致模型迅速过时。尽管存在这一挑战,现有文献对该问题的关注有限,诸多先进的安卓恶意软件检测方法(如Drebin、DroidDet和MaMaDroid)仍依赖静态模型。本工作展示了重训练技术如何随时间推移维持检测器的能力。具体而言,我们分析了两个因素对检测器效率与性能的影响:1)模型重训练的频率;2)重训练所使用的数据。在首个实验中,我们对比了周期性重训练与仅在必要时触发重训练的先进概念漂移检测方法。在第二个实验中,我们分析了减少重训练数据量的采样方法,具体比较了固定大小的近期数据窗口与最先进的主动学习方法(这些方法选择能保持训练数据集小且多样化的应用)。实验表明,概念漂移检测与样本选择机制可形成高效的重训练策略,成功用于在动态环境中维持静态安卓恶意软件前沿检测器的性能。