Twitter, as one of the most popular social networks, provides a platform for communication and online discourse. Unfortunately, it has also become a target for bots and fake accounts, resulting in the spread of false information and manipulation. This paper introduces a semi-automatic machine learning pipeline (SAMLP) designed to address the challenges correlated with machine learning model development. Through this pipeline, we develop a comprehensive bot detection model named BotArtist, based on user profile features. SAMLP leverages nine distinct publicly available datasets to train the BotArtist model. To assess BotArtist's performance against current state-of-the-art solutions, we select 35 existing Twitter bot detection methods, each utilizing a diverse range of features. Our comparative evaluation of BotArtist and these existing methods, conducted across nine public datasets under standardized conditions, reveals that the proposed model outperforms existing solutions by almost 10%, in terms of F1-score, achieving an average score of 83.19 and 68.5 over specific and general approaches respectively. As a result of this research, we provide a dataset of the extracted features combined with BotArtist predictions over the 10.929.533 Twitter user profiles, collected via Twitter API during the 2022 Russo-Ukrainian War, over a 16-month period. This dataset was created in collaboration with [Shevtsov et al., 2022a] where the original authors share anonymized tweets on the discussion of the Russo-Ukrainian war with a total amount of 127.275.386 tweets. The combination of the existing text dataset and the provided labeled bot and human profiles will allow for the future development of a more advanced bot detection large language model in the post-Twitter API era.
翻译:Twitter作为最受欢迎的社交网络之一,为公众交流与在线讨论提供了平台。然而,该平台也日益成为机器人与虚假账户的目标,导致虚假信息传播与舆论操纵。本文提出一种半自动机器学习流程(SAMLP),旨在应对机器学习模型开发过程中的相关挑战。通过该流程,我们基于用户画像特征开发了名为BotArtist的综合性机器人检测模型。SAMLP利用九个不同的公开可用数据集对BotArtist模型进行训练。为评估BotArtist相对于当前前沿解决方案的性能,我们选取了35种现有的Twitter机器人检测方法,这些方法各自采用了多样化的特征集。我们在标准化条件下对九个公开数据集进行的比较评估表明,所提出的模型在F1分数上以近10%的优势超越现有解决方案,在特定方法与通用方法上分别达到83.19和68.5的平均分数。基于本研究,我们提供了一个包含提取特征与BotArtist预测结果的数据集,涵盖10,929,533个Twitter用户画像,这些数据通过Twitter API在2022年俄乌战争期间历时16个月收集完成。该数据集是与[Shevtsov et al., 2022a]合作创建的,原研究者共享了关于俄乌战争讨论的匿名推文,总量达127,275,386条。现有文本数据集与提供的标注机器人及人类画像数据相结合,将有助于在后Twitter API时代开发更先进的机器人检测大语言模型。