Despite the promising results of machine learning models in malware detection, they face the problem of concept drift due to malware constant evolution. This leads to a decline in performance over time, as the data distribution of the new files differs from the training one, requiring regular model update. In this work, we propose a model-agnostic protocol to improve a baseline neural network to handle with the drift problem. We show the importance of feature reduction and training with the most recent validation set possible, and propose a loss function named Drift-Resilient Binary Cross-Entropy, an improvement to the classical Binary Cross-Entropy more effective against drift. We train our model on the EMBER dataset (2018) and evaluate it on a dataset of recent malicious files, collected between 2020 and 2023. Our improved model shows promising results, detecting 15.2% more malware than a baseline model.
翻译:尽管机器学习模型在恶意软件检测中取得了令人瞩目的成果,但由于恶意软件的持续演化,它们面临着概念漂移的问题。这导致模型性能随时间下降,因为新文件的数据分布与训练数据不同,需要定期更新模型。本文提出了一种模型无关的协议,用于改进基线神经网络以应对漂移问题。我们证明了特征约简以及使用最新验证集进行训练的重要性,并提出了一种名为"漂移弹性二元交叉熵"的损失函数,该函数是对经典二元交叉熵的改进,能更有效地对抗漂移。我们在EMBER数据集(2018年)上训练模型,并在2020年至2023年间收集的近期恶意文件数据集上对其进行评估。改进后的模型展现出显著效果,检测到的恶意软件比基线模型多15.2%。