Label-efficient Training Updates for Malware Detection over Time

Machine Learning (ML)-based detectors are becoming essential to counter the proliferation of malware. However, common ML algorithms are not designed to cope with the dynamic nature of real-world settings, where both legitimate and malicious software evolve. This distribution drift causes models trained under static assumptions to degrade over time unless they are continuously updated. Regularly retraining these models, however, is expensive, since labeling new acquired data requires costly manual analysis by security experts. To reduce labeling costs and address distribution drift in malware detection, prior work explored active learning (AL) and semi-supervised learning (SSL) techniques. Yet, existing studies (i) are tightly coupled to specific detector architectures and restricted to a specific malware domain, resulting in non-uniform comparisons; and (ii) lack a consistent methodology for analyzing the distribution drift, despite the critical sensitivity of the malware domain to temporal changes. In this work, we bridge this gap by proposing a model-agnostic framework that evaluates an extensive set of AL and SSL techniques, isolated and combined, for Android and Windows malware detection. We show that these techniques, when combined, can reduce manual annotation costs by up to 90% across both domains while achieving comparable detection performance to full-labeling retraining. We also introduce a methodology for feature-level drift analysis that measures feature stability over time, showing its correlation with the detector performance. Overall, our study provides a detailed understanding of how AL and SSL behave under distribution drift and how they can be successfully combined, offering practical insights for the design of effective detectors over time.

翻译：基于机器学习（ML）的检测器对于应对恶意软件泛滥至关重要。然而，常见的机器学习算法并非设计用于应对真实场景的动态性质——在真实场景中，合法软件与恶意软件均会持续演化。这种分布偏移导致在静态假设下训练的模型若不持续更新，其性能将随时间退化。然而，定期重新训练这些模型代价高昂，因为对新增数据打标签需要安全专家进行昂贵的人工分析。为降低标签成本并应对恶意软件检测中的分布偏移，先前研究探索了主动学习（AL）和半监督学习（SSL）技术。但现有研究存在以下不足：（i）与特定检测器架构紧密耦合且局限于特定恶意软件领域，导致比较标准不统一；（ii）缺乏分析分布偏移的系统性方法论，尽管恶意软件领域对时间变化极为敏感。本研究通过提出一个模型无关的框架弥补了这一空白，该框架在安卓和Windows恶意软件检测中，对AL与SSL技术（单独及组合应用）进行了全面评估。我们发现，当这些技术组合使用时，可在两个领域中将人工标注成本降低高达90%，同时获得与全标签重训练相当的检测性能。我们还引入了一种基于特征级漂移分析的方法论，通过测量特征随时间变化的稳定性，揭示其与检测器性能的相关性。总体而言，本研究深入揭示了AL与SSL在分布偏移下的行为特性及其成功组合的机制，为随时间演化的有效检测器设计提供了实践指导。