A persistently popular topic in online social networks is the rapid and accurate discovery of bot accounts to prevent their invasion and harassment of genuine users. We propose a unified embedding framework called BotTriNet, which utilizes textual content posted by accounts for bot detection based on the assumption that contexts naturally reveal account personalities and habits. Content is abundant and valuable if the system efficiently extracts bot-related information using embedding techniques. Beyond the general embedding framework that generates word, sentence, and account embeddings, we design a triplet network to tune the raw embeddings (produced by traditional natural language processing techniques) for better classification performance. We evaluate detection accuracy and f1score on a real-world dataset CRESCI2017, comprising three bot account categories and five bot sample sets. Our system achieves the highest average accuracy of 98.34% and f1score of 97.99% on two content-intensive bot sets, outperforming previous work and becoming state-of-the-art. It also makes a breakthrough on four content-less bot sets, with an average accuracy improvement of 11.52% and an average f1score increase of 16.70%.
翻译:在线社交网络中一个持续热门的话题是如何快速准确地发现机器人账户,以防止其对真实用户的入侵和骚扰。我们提出了一种名为BotTriNet的统一嵌入框架,该框架利用账户发布的文本内容进行机器人检测,其假设在于文本内容自然揭示了账户的人格特征和行为习惯。若系统能通过嵌入技术高效提取机器人相关信息,内容将是丰富且有价值的。除了生成词、句子和账户嵌入的通用嵌入框架外,我们设计了一个三元组网络来调整由传统自然语言处理技术生成的原始嵌入,以获得更优的分类性能。我们在真实数据集CRESCI2017上评估了检测准确率和F1分数,该数据集包含三类机器人账户和五组机器人样本。我们的系统在两个内容密集型机器人集合上取得了最高平均准确率98.34%和平均F1分数97.99%,超越了以往工作并达到当前最优水平。此外,该系统在四个内容稀疏型机器人集合上取得了突破,平均准确率提升11.52%,平均F1分数提升16.70%。