Multimodal content, such as mixing text with images, presents significant challenges to rumor detection in social media. Existing multimodal rumor detection has focused on mixing tokens among spatial and sequential locations for unimodal representation or fusing clues of rumor veracity across modalities. However, they suffer from less discriminative unimodal representation and are vulnerable to intricate location dependencies in the time-consuming fusion of spatial and sequential tokens. This work makes the first attempt at multimodal rumor detection in the frequency domain, which efficiently transforms spatial features into the frequency spectrum and obtains highly discriminative spectrum features for multimodal representation and fusion. A novel Frequency Spectrum Representation and fUsion network (FSRU) with dual contrastive learning reveals the frequency spectrum is more effective for multimodal representation and fusion, extracting the informative components for rumor detection. FSRU involves three novel mechanisms: utilizing the Fourier transform to convert features in the spatial domain to the frequency domain, the unimodal spectrum compression, and the cross-modal spectrum co-selection module in the frequency domain. Substantial experiments show that FSRU achieves satisfactory multimodal rumor detection performance.
翻译:多模态内容(如文本与图像混合)给社交媒体中的谣言检测带来了重大挑战。现有多模态谣言检测方法主要关注通过空间和序列位置的标记混合实现单模态表示,或跨模态融合谣言真实性的线索。然而,这些方法存在单模态表示区分性不足的问题,且在处理空间与序列标记的耗时融合过程中易受复杂位置依赖性的影响。本研究首次尝试在频域中进行多模态谣言检测,通过高效地将空间特征转换为频谱,获取高区分度的频谱特征用于多模态表示与融合。一种基于双对比学习的新型频谱表示与融合网络(FSRU)揭示了频谱在多模态表示与融合中更具有效性,能够提取用于谣言检测的信息成分。FSRU包含三种创新机制:利用傅里叶变换将空间域特征转换至频域、单模态频谱压缩,以及频域中的跨模态频谱共选模块。大量实验表明,FSRU在多模态谣言检测任务中取得了令人满意的性能。