MOMENTA: Mixture-of-Experts Over Multimodal Embeddings with Neural Temporal Aggregation for Misinformation Detection

The widespread dissemination of multimodal content on social media has made misinformation detection increasingly challenging, as misleading narratives often arise not only from textual or visual content alone, but also from semantic inconsistencies between modalities and their evolution over time. Existing multimodal misinformation detection methods typically model cross-modal interactions statically and often show limited robustness across heterogeneous datasets, domains, and narrative settings. To address these challenges, we propose MOMENTA, a unified framework for multimodal misinformation detection that captures modality heterogeneity, cross-modal inconsistency, temporal dynamics, and cross-domain generalization within a single architecture. MOMENTA employs modality-specific mixture-of-experts modules to model diverse misinformation patterns, bidirectional co-attention to align textual and visual representations in a shared semantic space, and a discrepancy-aware branch to explicitly capture semantic disagreement between modalities. To model narrative evolution, we introduce an attention-based temporal aggregation mechanism with drift and momentum encoding over overlapping time windows, enabling the framework to capture both short-term fluctuations and longer-term trends in misinformation propagation. In addition, domain-adversarial learning and a prototype memory bank improve domain invariance and stabilize representation learning across datasets. The model is trained using a multi-objective optimization strategy that jointly enforces classification performance, cross-modal alignment, contrastive learning, temporal consistency, and domain robustness. Experiments on Fakeddit, MMCoVaR, Weibo, and XFacta show that MOMENTA achieves strong, consistent results across accuracy, F1-score, AUC, and MCC, highlighting its effectiveness for multimodal misinformation detection.

翻译：摘要：社交媒体上多模态内容的广泛传播使得虚假信息检测面临日益严峻的挑战，因为误导性叙事往往不仅源于单一的文本或视觉内容，更源于模态间的语义不一致性及其随时间演化的特征。现有跨模态虚假信息检测方法通常静态建模模态间交互，且在异构数据集、领域及叙事场景下的鲁棒性有限。为应对这些挑战，我们提出MOMENTA——一个统一的多模态虚假信息检测框架，可在单一架构中捕捉模态异质性、跨模态不一致性、时间动态特性及跨领域泛化能力。该框架采用模态专用混合专家模块建模多样化的虚假信息模式，通过双向协同注意力机制将文本与视觉表示对齐至共享语义空间，并引入差异感知分支显式捕获模态间的语义分歧。为建模叙事演化过程，我们提出基于注意力机制的时间聚合方法，通过重叠时间窗口中的漂移动量编码，使框架能够同时捕捉虚假信息传播中的短期波动与长期趋势。此外，领域对抗学习与原型记忆库增强了领域不变性，并稳定了跨数据集的表示学习过程。模型采用多目标优化策略进行训练，协同优化分类性能、跨模态对齐、对比学习、时间一致性及领域鲁棒性。在Fakeddit、MMCoVaR、Weibo及XFacta数据集上的实验表明，MOMENTA在准确率、F1分数、AUC及MCC指标上均取得稳定且优异的性能，充分验证了其多模态虚假信息检测的有效性。