Automated Machine Learning for Deep Learning based Malware Detection

Deep learning (DL) has proven to be effective in detecting sophisticated malware that is constantly evolving. Even though deep learning has alleviated the feature engineering problem, finding the most optimal DL model, in terms of neural architecture search (NAS) and the model's optimal set of hyper-parameters, remains a challenge that requires domain expertise. In addition, many of the proposed state-of-the-art models are very complex and may not be the best fit for different datasets. A promising approach, known as Automated Machine Learning (AutoML), can reduce the domain expertise required to implement a custom DL model. AutoML reduces the amount of human trial-and-error involved in designing DL models, and in more recent implementations can find new model architectures with relatively low computational overhead. This work provides a comprehensive analysis and insights on using AutoML for static and online malware detection. For static, our analysis is performed on two widely used malware datasets: SOREL-20M to demonstrate efficacy on large datasets; and EMBER-2018, a smaller dataset specifically curated to hinder the performance of machine learning models. In addition, we show the effects of tuning the NAS process parameters on finding a more optimal malware detection model on these static analysis datasets. Further, we also demonstrate that AutoML is performant in online malware detection scenarios using Convolutional Neural Networks (CNNs) for cloud IaaS. We compare an AutoML technique to six existing state-of-the-art CNNs using a newly generated online malware dataset with and without other applications running in the background during malware execution.In general, our experimental results show that the performance of AutoML based static and online malware detection models are on par or even better than state-of-the-art models or hand-designed models presented in literature.

翻译：深度学习（DL）已被证明能有效检测不断演化的复杂恶意软件。尽管深度学习缓解了特征工程设计问题，但在神经架构搜索（NAS）和模型最优超参数集方面寻找最理想的DL模型，仍需领域专业知识这一挑战。此外，许多现有最优模型极为复杂，可能并非最适合不同数据集。一种被称为自动化机器学习（AutoML）的有前景方法，能减少实现定制化DL模型所需的领域专业知识。AutoML降低了设计DL模型中涉及的人工试错量，并在最新实现中能以较低计算开销找到新型模型架构。本文对使用AutoML进行静态与在线恶意软件检测提供了全面分析与洞见。对于静态检测，我们在两个广泛使用的恶意软件数据集上进行分析：SOREL-20M（展示大数据集上的有效性）和EMBER-2018（一个特意设计以削弱机器学习模型性能的小型数据集）。此外，我们展示了调整NAS过程参数对在这些静态分析数据集上找到更优恶意软件检测模型的影响。进一步，我们证明AutoML在使用卷积神经网络（CNN）进行云IaaS在线恶意软件检测场景中的高性能。我们使用新生成的在线恶意软件数据集（含恶意软件执行时后台运行其他应用程序与不含此条件），将一种AutoML技术与六种现有最优CNN进行对比。总体而言，实验结果表明，基于AutoML的静态与在线恶意软件检测模型的性能与文献中提出的最优模型或手工设计模型相当，甚至更优。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/