In this paper, we propose a framework for early-stage malware detection and mitigation by leveraging natural language processing (NLP) techniques and machine learning algorithms. Our primary contribution is presenting an approach for predicting the upcoming actions of malware by treating application programming interface (API) call sequences as natural language inputs and employing text classification methods, specifically a Bi-LSTM neural network, to predict the next API call. This enables proactive threat identification and mitigation, demonstrating the effectiveness of applying NLP principles to API call sequences. The Bi-LSTM model is evaluated using two datasets. %The model achieved an accuracy of 93.6\% and 88.8\% for the %first and second dataset respectively. Additionally, by modeling consecutive API calls as 2-gram and 3-gram strings, we extract new features to be further processed using a Bagging-XGBoost algorithm, effectively predicting malware presence at its early stages. The accuracy of the proposed framework is evaluated by simulations.
翻译:本文提出了一种利用自然语言处理(NLP)技术和机器学习算法进行早期恶意软件检测与缓解的框架。我们的主要贡献在于提出了一种方法,通过将应用程序编程接口(API)调用序列视为自然语言输入,并采用文本分类方法(具体为双向长短期记忆(Bi-LSTM)神经网络)来预测恶意软件的后续动作,实现了对威胁的主动识别与缓解,展示了将NLP原理应用于API调用序列的有效性。该Bi-LSTM模型使用两个数据集进行评估。此外,通过将连续API调用建模为二元组(2-gram)和三元组(3-gram)字符串,我们提取了新特征,并采用Bagging-XGBoost算法进行进一步处理,从而在恶意软件早期阶段有效预测其存在。所提框架的准确性通过仿真进行了评估。