Twitter as one of the most popular social networks, offers a means for communication and online discourse, which unfortunately has been the target of bots and fake accounts, leading to the manipulation and spreading of false information. Towards this end, we gather a challenging, multilingual dataset of social discourse on Twitter, originating from 9M users regarding the recent Russo-Ukrainian war, in order to detect the bot accounts and the conversation involving them. We collect the ground truth for our dataset through the Twitter API suspended accounts collection, containing approximately 343K of bot accounts and 8M of normal users. Additionally, we use a dataset provided by Botometer-V3 with 1,777 Varol, 483 German accounts, and 1,321 US accounts. Besides the publicly available datasets, we also manage to collect 2 independent datasets around popular discussion topics of the 2022 energy crisis and the 2022 conspiracy discussions. Both of the datasets were labeled according to the Twitter suspension mechanism. We build a novel ML model for bot detection using the state-of-the-art XGBoost model. We combine the model with a high volume of labeled tweets according to the Twitter suspension mechanism ground truth. This requires a limited set of profile features allowing labeling of the dataset in different time periods from the collection, as it is independent of the Twitter API. In comparison with Botometer our methodology achieves an average 11% higher ROC-AUC score over two real-case scenario datasets.
翻译:推特作为最受欢迎的社交网络之一,为沟通和在线讨论提供了途径,但不幸的是,它已成为机器人和虚假账户的目标,导致虚假信息的操纵和传播。为此,我们收集了一个具有挑战性的、多语言的推特社交讨论数据集,该数据集源自900万用户关于近期俄乌战争的讨论,旨在检测机器人账户及其参与的对话。我们通过推特API的封禁账户集合为该数据集获取了真实标签,包含约34.3万机器人账户和800万正常用户。此外,我们还使用了Botometer-V3提供的数据集,包含1,777个Varol账户、483个德国账户和1,321个美国账户。除了公开可用的数据集,我们还成功收集了两个独立的数据集,主题分别为2022年能源危机和2022年阴谋论讨论。这两个数据集均根据推特封禁机制进行标注。我们构建了一种新颖的机器学习模型用于机器人检测,采用了最先进的XGBoost模型。我们将该模型与大量根据推特封禁机制真实标签标注的推文相结合。该方法仅需有限的用户资料特征,允许在不同时间段对数据集进行标注(独立于推特API)。与Botometer相比,我们的方法在两个真实场景数据集上平均ROC-AUC分数提高了11%。