Automated Machine Learning for Deep Learning based Malware Detection

Deep learning (DL) has proven to be effective in detecting sophisticated malware that is constantly evolving. Even though deep learning has alleviated the feature engineering problem, finding the most optimal DL model, in terms of neural architecture search (NAS) and the model's optimal set of hyper-parameters, remains a challenge that requires domain expertise. In addition, many of the proposed state-of-the-art models are very complex and may not be the best fit for different datasets. A promising approach, known as Automated Machine Learning (AutoML), can reduce the domain expertise required to implement a custom DL model. AutoML reduces the amount of human trial-and-error involved in designing DL models, and in more recent implementations can find new model architectures with relatively low computational overhead. This work provides a comprehensive analysis and insights on using AutoML for static and online malware detection. For static, our analysis is performed on two widely used malware datasets: SOREL-20M to demonstrate efficacy on large datasets; and EMBER-2018, a smaller dataset specifically curated to hinder the performance of machine learning models. In addition, we show the effects of tuning the NAS process parameters on finding a more optimal malware detection model on these static analysis datasets. Further, we also demonstrate that AutoML is performant in online malware detection scenarios using Convolutional Neural Networks (CNNs) for cloud IaaS. We compare an AutoML technique to six existing state-of-the-art CNNs using a newly generated online malware dataset with and without other applications running in the background during malware execution.In general, our experimental results show that the performance of AutoML based static and online malware detection models are on par or even better than state-of-the-art models or hand-designed models presented in literature.

翻译：深度学习已被证明在检测不断演变的复杂恶意软件方面具有有效性。尽管深度学习缓解了特征工程问题，但就神经架构搜索（NAS）和模型超参数最优集而言，寻找最适配的深度学习模型仍需领域专业知识支撑。此外，诸多先进模型结构复杂，可能并非不同数据集的最佳选择。自动化机器学习（AutoML）这一新兴方法可降低实现定制深度学习模型所需的领域专业知识。AutoML减少了设计深度学习模型时的人工试错过程，并在最新实现中能以较低计算开销发现新型模型架构。本文对使用AutoML进行静态和在线恶意软件检测进行了全面分析与洞察。针对静态检测，我们基于两个广泛使用的恶意软件数据集展开分析：SOREL-20M用于验证大型数据集上的有效性；EMBER-2018则是一个专门设计以削弱机器学习模型性能的较小数据集。此外，我们展示了调整NAS过程参数对在这些静态分析数据集上发现更优恶意软件检测模型的影响。进一步地，我们证实在使用卷积神经网络（CNN）的云IaaS在线恶意软件检测场景中，AutoML同样表现优异。我们利用新生成的在线恶意软件数据集（包含背景应用运行与否两种场景），将AutoML技术与六种现有先进CNN模型进行对比。总体而言，实验结果表明，基于AutoML的静态与在线恶意软件检测模型性能与现有先进模型或文献中手工设计的模型相当，甚至更优。