As malicious actors employ increasingly advanced and widespread bots to disseminate misinformation and manipulate public opinion, the detection of Twitter bots has become a crucial task. Though graph-based Twitter bot detection methods achieve state-of-the-art performance, we find that their inference depends on the neighbor users multi-hop away from the targets, and fetching neighbors is time-consuming and may introduce bias. At the same time, we find that after finetuning on Twitter bot detection, pretrained language models achieve competitive performance and do not require a graph structure during deployment. Inspired by this finding, we propose a novel bot detection framework LMBot that distills the knowledge of graph neural networks (GNNs) into language models (LMs) for graph-less deployment in Twitter bot detection to combat the challenge of data dependency. Moreover, LMBot is compatible with graph-based and graph-less datasets. Specifically, we first represent each user as a textual sequence and feed them into the LM for domain adaptation. For graph-based datasets, the output of LMs provides input features for the GNN, enabling it to optimize for bot detection and distill knowledge back to the LM in an iterative, mutually enhancing process. Armed with the LM, we can perform graph-less inference, which resolves the graph data dependency and sampling bias issues. For datasets without graph structure, we simply replace the GNN with an MLP, which has also shown strong performance. Our experiments demonstrate that LMBot achieves state-of-the-art performance on four Twitter bot detection benchmarks. Extensive studies also show that LMBot is more robust, versatile, and efficient compared to graph-based Twitter bot detection methods.
翻译:随着恶意行为者采用日益先进且广泛传播的机器人来传播虚假信息并操纵公众舆论,推特机器人检测已成为一项关键任务。尽管基于图的推特机器人检测方法取得了最先进的性能,但我们发现其推理依赖于距离目标用户多跳的邻居用户,而获取邻居信息既耗时又可能引入偏差。同时,我们发现经过推特机器人检测微调后,预训练语言模型在不依赖图结构的情况下也能取得具有竞争力的性能。受此发现启发,我们提出了一种新颖的机器人检测框架LMBot,该框架将图神经网络(GNN)的知识蒸馏到语言模型(LM)中,用于推特机器人检测中的无图部署,从而应对数据依赖的挑战。此外,LMBot兼容基于图和无图的数据集。具体而言,我们首先将每个用户表示为文本序列,并将其输入语言模型进行领域适应。对于基于图的数据集,语言模型的输出为GNN提供输入特征,使其能够优化机器人检测并以迭代、相互增强的方式将知识蒸馏回语言模型。凭借语言模型的支持,我们可以执行无图推理,从而解决图数据依赖和采样偏差问题。对于无图结构的数据集,我们只需将GNN替换为MLP,该方法同样表现出色。实验表明,LMBot在四个推特机器人检测基准上取得了最先进的性能。大量研究还表明,与基于图的推特机器人检测方法相比,LMBot更鲁棒、更通用且更高效。