Recently, spam on online social networks has attracted attention in the research and business world. Twitter has become the preferred medium to spread spam content. Many research efforts attempted to encounter social networks spam. Twitter brought extra challenges represented by the feature space size, and imbalanced data distributions. Usually, the related research works focus on part of these main challenges or produce black-box models. In this paper, we propose a modified genetic algorithm for simultaneous dimensionality reduction and hyper parameter optimization over imbalanced datasets. The algorithm initialized an eXtreme Gradient Boosting classifier and reduced the features space of tweets dataset; to generate a spam prediction model. The model is validated using a 50 times repeated 10-fold stratified cross-validation, and analyzed using nonparametric statistical tests. The resulted prediction model attains on average 82.32\% and 92.67\% in terms of geometric mean and accuracy respectively, utilizing less than 10\% of the total feature space. The empirical results show that the modified genetic algorithm outperforms $Chi^2$ and $PCA$ feature selection methods. In addition, eXtreme Gradient Boosting outperforms many machine learning algorithms, including BERT-based deep learning model, in spam prediction. Furthermore, the proposed approach is applied to SMS spam modeling and compared to related works.
翻译:近年来,在线社交网络中的垃圾邮件问题已引起科研界和商业界的广泛关注。推特已成为传播垃圾内容的首选媒介。诸多研究致力于应对社交网络垃圾邮件问题,但推特带来了特征空间规模庞大与数据分布不平衡的双重挑战。现有相关研究通常仅针对部分核心挑战,或生成黑箱模型。本文提出一种改进遗传算法,可在不平衡数据集上同步实现降维与超参数优化。该算法初始化极端梯度提升分类器并缩减推文数据集的特征空间,进而生成垃圾邮件预测模型。模型采用50次重复10折分层交叉验证进行验证,并通过非参数统计检验进行分析。最终预测模型在仅使用全部特征空间不足10%的情况下,几何均值与准确率分别达到82.32%和92.67%。实验结果表明,改进遗传算法在特征选择上优于卡方检验与主成分分析方法。此外,在垃圾邮件预测任务中,极端梯度提升算法优于包括基于BERT的深度学习模型在内的多种机器学习算法。本研究还将该方案应用于短信垃圾邮件建模,并与相关成果进行了比较。