BotArtist: Generic approach for bot detection in Twitter via semi-automatic machine learning pipeline

Twitter, as one of the most popular social networks, provides a platform for communication and online discourse. Unfortunately, it has also become a target for bots and fake accounts, resulting in the spread of false information and manipulation. This paper introduces a semi-automatic machine learning pipeline (SAMLP) designed to address the challenges correlated with machine learning model development. Through this pipeline, we develop a comprehensive bot detection model named BotArtist, based on user profile features. SAMLP leverages nine distinct publicly available datasets to train the BotArtist model. To assess BotArtist's performance against current state-of-the-art solutions, we select 35 existing Twitter bot detection methods, each utilizing a diverse range of features. Our comparative evaluation of BotArtist and these existing methods, conducted across nine public datasets under standardized conditions, reveals that the proposed model outperforms existing solutions by almost 10%, in terms of F1-score, achieving an average score of 83.19 and 68.5 over specific and general approaches respectively. As a result of this research, we provide a dataset of the extracted features combined with BotArtist predictions over the 10.929.533 Twitter user profiles, collected via Twitter API during the 2022 Russo-Ukrainian War, over a 16-month period. This dataset was created in collaboration with [Shevtsov et al., 2022a] where the original authors share anonymized tweets on the discussion of the Russo-Ukrainian war with a total amount of 127.275.386 tweets. The combination of the existing text dataset and the provided labeled bot and human profiles will allow for the future development of a more advanced bot detection large language model in the post-Twitter API era.

翻译：Twitter作为最受欢迎的社交网络之一，为公众交流与在线讨论提供了平台。然而，该平台也日益成为机器人与虚假账户的目标，导致虚假信息传播与舆论操纵。本文提出一种半自动机器学习流程（SAMLP），旨在应对机器学习模型开发过程中的相关挑战。通过该流程，我们基于用户画像特征开发了名为BotArtist的综合性机器人检测模型。SAMLP利用九个不同的公开可用数据集对BotArtist模型进行训练。为评估BotArtist相对于当前前沿解决方案的性能，我们选取了35种现有的Twitter机器人检测方法，这些方法各自采用了多样化的特征集。我们在标准化条件下对九个公开数据集进行的比较评估表明，所提出的模型在F1分数上以近10%的优势超越现有解决方案，在特定方法与通用方法上分别达到83.19和68.5的平均分数。基于本研究，我们提供了一个包含提取特征与BotArtist预测结果的数据集，涵盖10,929,533个Twitter用户画像，这些数据通过Twitter API在2022年俄乌战争期间历时16个月收集完成。该数据集是与[Shevtsov et al., 2022a]合作创建的，原研究者共享了关于俄乌战争讨论的匿名推文，总量达127,275,386条。现有文本数据集与提供的标注机器人及人类画像数据相结合，将有助于在后Twitter API时代开发更先进的机器人检测大语言模型。

相关内容

Machine Learning

关注 2251

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

【AI应用】Facebook-利用神经网络求解高等数学方程, Using neural networks to solve advanced mathematics equations

专知会员服务

34+阅读 · 2020年1月15日