BotArtist: Twitter bot detection Machine Learning model based on Twitter suspension

Twitter as one of the most popular social networks, offers a means for communication and online discourse, which unfortunately has been the target of bots and fake accounts, leading to the manipulation and spreading of false information. Towards this end, we gather a challenging, multilingual dataset of social discourse on Twitter, originating from 9M users regarding the recent Russo-Ukrainian war, in order to detect the bot accounts and the conversation involving them. We collect the ground truth for our dataset through the Twitter API suspended accounts collection, containing approximately 343K of bot accounts and 8M of normal users. Additionally, we use a dataset provided by Botometer-V3 with 1,777 Varol, 483 German accounts, and 1,321 US accounts. Besides the publicly available datasets, we also manage to collect 2 independent datasets around popular discussion topics of the 2022 energy crisis and the 2022 conspiracy discussions. Both of the datasets were labeled according to the Twitter suspension mechanism. We build a novel ML model for bot detection using the state-of-the-art XGBoost model. We combine the model with a high volume of labeled tweets according to the Twitter suspension mechanism ground truth. This requires a limited set of profile features allowing labeling of the dataset in different time periods from the collection, as it is independent of the Twitter API. In comparison with Botometer our methodology achieves an average 11% higher ROC-AUC score over two real-case scenario datasets.

翻译：作为最受欢迎的社交网络之一，Twitter 提供了沟通和在线讨论的途径。然而，它不幸地成为了机器人和虚假账户的目标，导致虚假信息的操纵和传播。为此，我们收集了一个具有挑战性的、多语言的 Twitter 社交讨论数据集，这些讨论源自 900 万用户，内容涉及近期的俄乌战争，旨在检测机器人账户以及与之相关的对话。我们通过 Twitter API 的已封禁账户收集来获取数据集的基础事实，其中包含约 34.3 万个机器人账户和 800 万正常用户。此外，我们还使用了 Botometer-V3 提供的数据集，包括 1,777 个 Varol 账户、483 个德国账户和 1,321 个美国账户。除了公开可用的数据集，我们还自行收集了两个关于 2022 年能源危机和 2022 年阴谋论讨论的热门话题的独立数据集。这两个数据集均根据 Twitter 封禁机制进行了标注。我们使用最先进的 XGBoost 模型构建了一个新颖的机器人检测机器学习模型。我们将该模型与大量根据 Twitter 封禁机制基础事实标注的推文相结合。这种方法仅需一组有限的用户档案特征，就可以在数据收集后的不同时间段对数据集进行标注，因为它不依赖于 Twitter API。与 Botometer 相比，我们的方法在两个真实场景数据集上的 ROC-AUC 平均得分提高了 11%。