The easy sharing of multimedia content on social media has caused a rapid dissemination of fake news, which threatens society's stability and security. Therefore, fake news detection has garnered extensive research interest in the field of social forensics. Current methods primarily concentrate on the integration of textual and visual features but fail to effectively exploit multi-modal information at both fine-grained and coarse-grained levels. Furthermore, they suffer from an ambiguity problem due to a lack of correlation between modalities or a contradiction between the decisions made by each modality. To overcome these challenges, we present a Multi-grained Multi-modal Fusion Network (MMFN) for fake news detection. Inspired by the multi-grained process of human assessment of news authenticity, we respectively employ two Transformer-based pre-trained models to encode token-level features from text and images. The multi-modal module fuses fine-grained features, taking into account coarse-grained features encoded by the CLIP encoder. To address the ambiguity problem, we design uni-modal branches with similarity-based weighting to adaptively adjust the use of multi-modal features. Experimental results demonstrate that the proposed framework outperforms state-of-the-art methods on three prevalent datasets.
翻译:社交媒体上多媒体内容的便捷分享导致虚假新闻的快速传播,对社会稳定与安全构成威胁。因此,虚假新闻检测已成为社会取证领域的研究热点。现有方法主要关注文本与视觉特征的整合,但未能有效利用细粒度和粗粒度层面的多模态信息。此外,由于模态间缺乏关联性或各模态决策存在矛盾,这些方法面临歧义性问题。为解决上述挑战,我们提出一种多粒度多模态融合网络(MMFN)用于虚假新闻检测。受人类评估新闻真实性时采用的多粒度过程启发,分别采用两个基于Transformer的预训练模型对文本和图像的词元级特征进行编码。多模态模块融合细粒度特征,同时考虑CLIP编码器编码的粗粒度特征。为处理歧义性问题,我们设计了基于相似度加权的单模态分支,自适应调整多模态特征的使用。实验结果表明,所提框架在三个主流数据集上均优于现有最优方法。