Multimodal Detection of Social Spambots in Twitter using Transformers

Although not all bots are malicious, the vast majority of them are responsible for spreading misinformation and manipulating the public opinion about several issues, i.e., elections and many more. Therefore, the early detection of social spambots is crucial. Although there have been proposed methods for detecting bots in social media, there are still substantial limitations. For instance, existing research initiatives still extract a large number of features and train traditional machine learning algorithms or use GloVe embeddings and train LSTMs. However, feature extraction is a tedious procedure demanding domain expertise. Also, language models based on transformers have been proved to be better than LSTMs. Other approaches create large graphs and train graph neural networks requiring in this way many hours for training and access to computational resources. To tackle these limitations, this is the first study employing only the user description field and images of three channels denoting the type and content of tweets posted by the users. Firstly, we create digital DNA sequences, transform them to 3d images, and apply pretrained models of the vision domain, including EfficientNet, AlexNet, VGG16, etc. Next, we propose a multimodal approach, where we use TwHIN-BERT for getting the textual representation of the user description field and employ VGG16 for acquiring the visual representation for the image modality. We propose three different fusion methods, namely concatenation, gated multimodal unit, and crossmodal attention, for fusing the different modalities and compare their performances. Extensive experiments conducted on the Cresci '17 dataset demonstrate valuable advantages of our introduced approaches over state-of-the-art ones reaching Accuracy up to 99.98%.

翻译：尽管并非所有机器人都是恶意的，但绝大多数机器人负责传播虚假信息并操纵公众对多个议题（例如选举等）的看法。因此，社交垃圾机器人的早期检测至关重要。尽管已有多种检测社交媒体机器人的方法被提出，但仍存在显著局限性。例如，现有研究仍需要提取大量特征并训练传统机器学习算法，或使用GloVe词嵌入训练LSTM模型。然而，特征提取是一项繁琐且需要领域专业知识的过程。同时，基于transformer的语言模型已被证明优于LSTM。其他方法则需构建大规模图结构并训练图神经网络，这需要耗费大量训练时间及计算资源。为解决这些局限，本研究首次仅利用用户描述字段和三类图像（表征用户发布推文的类型和内容）进行检测。首先，我们创建数字DNA序列并将其转化为3D图像，应用视觉领域的预训练模型（包括EfficientNet、AlexNet、VGG16等）。其次，我们提出一种多模态方法：使用TwHIN-BERT获取用户描述字段的文本表征，并采用VGG16获取图像模态的视觉表征。我们提出三种不同的融合方法——拼接、门控多模态单元和跨模态注意力机制——用于融合不同模态，并比较其性能。在Cresci '17数据集上进行的大量实验表明，我们提出的方法相较于现有最先进方法具有显著优势，准确率最高可达99.98%。