Modified Genetic Algorithm for Feature Selection and Hyper Parameter Optimization: Case of XGBoost in Spam Prediction

Recently, spam on online social networks has attracted attention in the research and business world. Twitter has become the preferred medium to spread spam content. Many research efforts attempted to encounter social networks spam. Twitter brought extra challenges represented by the feature space size, and imbalanced data distributions. Usually, the related research works focus on part of these main challenges or produce black-box models. In this paper, we propose a modified genetic algorithm for simultaneous dimensionality reduction and hyper parameter optimization over imbalanced datasets. The algorithm initialized an eXtreme Gradient Boosting classifier and reduced the features space of tweets dataset; to generate a spam prediction model. The model is validated using a 50 times repeated 10-fold stratified cross-validation, and analyzed using nonparametric statistical tests. The resulted prediction model attains on average 82.32\% and 92.67\% in terms of geometric mean and accuracy respectively, utilizing less than 10\% of the total feature space. The empirical results show that the modified genetic algorithm outperforms $Chi^2$ and $PCA$ feature selection methods. In addition, eXtreme Gradient Boosting outperforms many machine learning algorithms, including BERT-based deep learning model, in spam prediction. Furthermore, the proposed approach is applied to SMS spam modeling and compared to related works.

翻译：近年来，在线社交网络中的垃圾邮件问题已引起科研界和商业界的广泛关注。推特已成为传播垃圾内容的首选媒介。诸多研究致力于应对社交网络垃圾邮件问题，但推特带来了特征空间规模庞大与数据分布不平衡的双重挑战。现有相关研究通常仅针对部分核心挑战，或生成黑箱模型。本文提出一种改进遗传算法，可在不平衡数据集上同步实现降维与超参数优化。该算法初始化极端梯度提升分类器并缩减推文数据集的特征空间，进而生成垃圾邮件预测模型。模型采用50次重复10折分层交叉验证进行验证，并通过非参数统计检验进行分析。最终预测模型在仅使用全部特征空间不足10%的情况下，几何均值与准确率分别达到82.32%和92.67%。实验结果表明，改进遗传算法在特征选择上优于卡方检验与主成分分析方法。此外，在垃圾邮件预测任务中，极端梯度提升算法优于包括基于BERT的深度学习模型在内的多种机器学习算法。本研究还将该方案应用于短信垃圾邮件建模，并与相关成果进行了比较。

相关内容

特征选择

关注 5940

特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程,是提高学习算法性能的一个重要手段,也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日