Machine learning (ML) has been widely used to analyze API call sequences in malware analysis, which typically requires the expertise of domain specialists to extract relevant features from raw data. The extracted features play a critical role in malware analysis. Traditional feature extraction is based on human domain knowledge, while there is a trend of using natural language processing (NLP) for automatic feature extraction. This raises a question: how do we effectively select features for malware analysis based on API call sequences? To answer it, this paper presents a comprehensive study of investigating the impact of feature engineering upon malware classification.We first conducted a comparative performance evaluation under three models, Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), and Transformer, with respect to knowledge-based and NLP-based feature engineering methods. We observed that models with knowledge-based feature engineering inputs generally outperform those using NLP-based across all metrics, especially under smaller sample sizes. Then we analyzed a complete set of data features from API call sequences, our analysis reveals that models often focus on features such as handles and virtual addresses, which vary across executions and are difficult for human analysts to interpret.
翻译:机器学习(ML)已被广泛应用于恶意软件分析中的API调用序列分析,这通常需要领域专家的专业知识从原始数据中提取相关特征。提取的特征在恶意软件分析中起着关键作用。传统的特征提取基于人类领域知识,而当前趋势是利用自然语言处理(NLP)进行自动特征提取。这引发了一个问题:如何基于API调用序列有效选择特征用于恶意软件分析?为回答此问题,本文对特征工程对恶意软件分类的影响进行了全面研究。我们首先在卷积神经网络(CNN)、长短期记忆网络(LSTM)和Transformer三种模型下,对基于知识的特征工程方法和基于NLP的特征工程方法进行了性能比较评估。我们观察到,在所有指标上,采用基于知识的特征工程输入的模型通常优于使用基于NLP方法的模型,尤其是在样本量较小时。随后,我们分析了API调用序列中的完整数据特征集,分析表明模型常关注如句柄和虚拟地址等特征,这些特征在不同执行间存在差异,且难以被人类分析师解释。