Machine learning (ML)-based malware detection systems are becoming increasingly important as malware threats increase and get more sophisticated. PDF files are often used as vectors for phishing attacks because they are widely regarded as trustworthy data resources, and are accessible across different platforms. Therefore, researchers have developed many different PDF malware detection methods. Performance in detecting PDF malware is greatly influenced by feature selection. In this research, we propose a small features set that don't require too much domain knowledge of the PDF file. We evaluate proposed features with six different machine learning models. We report the best accuracy of 99.75% when using Random Forest model. Our proposed feature set, which consists of just 12 features, is one of the most conciseness in the field of PDF malware detection. Despite its modest size, we obtain comparable results to state-of-the-art that employ a much larger set of features.
翻译:基于机器学习(ML)的恶意软件检测系统因恶意软件威胁的增加和复杂化而日益重要。PDF文件常被用作网络钓鱼攻击的载体,因其被广泛视为可信数据资源,且可跨平台访问。因此,研究人员已开发出多种PDF恶意软件检测方法。特征选择对PDF恶意软件检测性能影响显著。本研究提出一种小规模特征集,该特征集无需过多依赖PDF文件的领域知识。我们采用六种不同机器学习模型对所提特征进行评估,报告使用随机森林模型时达到99.75%的最佳准确率。我们提出的特征集仅包含12个特征,是PDF恶意软件检测领域最简洁的特征集之一。尽管规模较小,但我们获得了与采用更大特征集的现有最佳方法相当的结果。