As malicious actors employ increasingly advanced and widespread bots to disseminate misinformation and manipulate public opinion, the detection of Twitter bots has become a crucial task. Though graph-based Twitter bot detection methods achieve state-of-the-art performance, we find that their inference depends on the neighbor users multi-hop away from the targets, and fetching neighbors is time-consuming and may introduce bias. At the same time, we find that after finetuning on Twitter bot detection, pretrained language models achieve competitive performance and do not require a graph structure during deployment. Inspired by this finding, we propose a novel bot detection framework LMBot that distills the knowledge of graph neural networks (GNNs) into language models (LMs) for graph-less deployment in Twitter bot detection to combat the challenge of data dependency. Moreover, LMBot is compatible with graph-based and graph-less datasets. Specifically, we first represent each user as a textual sequence and feed them into the LM for domain adaptation. For graph-based datasets, the output of LMs provides input features for the GNN, enabling it to optimize for bot detection and distill knowledge back to the LM in an iterative, mutually enhancing process. Armed with the LM, we can perform graph-less inference, which resolves the graph data dependency and sampling bias issues. For datasets without graph structure, we simply replace the GNN with an MLP, which has also shown strong performance. Our experiments demonstrate that LMBot achieves state-of-the-art performance on four Twitter bot detection benchmarks. Extensive studies also show that LMBot is more robust, versatile, and efficient compared to graph-based Twitter bot detection methods.
翻译:随着恶意行为者利用日益先进且广泛传播的机器人来传播虚假信息并操纵公众舆论,Twitter机器人检测已成为一项关键任务。尽管基于图的Twitter机器人检测方法达到了最先进的性能,但我们发现其推断依赖于距离目标多跳的邻居用户,而获取这些邻居既耗时又可能引入偏差。同时,我们发现在Twitter机器人检测任务上微调后,预训练语言模型在不依赖图结构部署的情况下也能取得有竞争力的性能。受此发现启发,我们提出了一种新颖的机器人检测框架LMBot,该框架将图神经网络(GNN)的知识蒸馏到语言模型(LM)中,用于Twitter机器人检测的无图部署,以应对数据依赖挑战。此外,LMBot兼容基于图和无图的数据集。具体而言,我们首先将每个用户表示为文本序列,并将其输入LM进行领域自适应。对于基于图的数据集,LM的输出为GNN提供输入特征,使其能够针对机器人检测进行优化,并通过迭代的相互增强过程将知识蒸馏回LM。借助LM,我们可以执行无图推断,从而解决图数据依赖和采样偏差问题。对于无图结构的数据集,我们简单地将GNN替换为MLP,同样展现出强劲性能。实验表明,LMBot在四个Twitter机器人检测基准上达到了最先进的性能。大量研究还证明,与基于图的Twitter机器人检测方法相比,LMBot具有更强的鲁棒性、通用性和效率。