This paper investigates the application of natural language processing (NLP)-based n-gram analysis and machine learning techniques to enhance malware classification. We explore how NLP can be used to extract and analyze textual features from malware samples through n-grams, contiguous string or API call sequences. This approach effectively captures distinctive linguistic patterns among malware and benign families, enabling finer-grained classification. We delve into n-gram size selection, feature representation, and classification algorithms. While evaluating our proposed method on real-world malware samples, we observe significantly improved accuracy compared to the traditional methods. By implementing our n-gram approach, we achieved an accuracy of 99.02% across various machine learning algorithms by using hybrid feature selection technique to address high dimensionality. Hybrid feature selection technique reduces the feature set to only 1.6% of the original features.
翻译:本文研究了基于自然语言处理(NLP)的n元语法分析与机器学习技术在恶意软件分类中的应用。我们探讨了如何通过n元语法(连续的字符串或API调用序列)从恶意软件样本中提取并分析文本特征。该方法能有效捕捉恶意软件与良性软件家族之间独特的语言模式,从而实现更细粒度的分类。我们深入研究了n元语法尺寸选择、特征表示和分类算法。通过在真实恶意软件样本上评估所提出的方法,我们发现其分类精度较传统方法有显著提升。通过实施n元语法方法,并采用混合特征选择技术处理高维特征问题,我们在多种机器学习算法中实现了99.02%的分类准确率。混合特征选择技术将特征集缩减至原始特征的1.6%。