Short-form video platforms are major channels for news but also fertile ground for multimodal misinformation where each modality appears plausible alone yet cross-modal relationships are subtly inconsistent, like mismatched visuals and captions. On two benchmark datasets, FakeSV (Chinese) and FakeTT (English), we observe a clear asymmetry: real videos exhibit high text-visual but moderate text-audio consistency, while fake videos show the opposite pattern. Moreover, a single global consistency score forms an interpretable axis along which fake probability and prediction errors vary smoothly. Motivated by these observations, we present MAGIC3 (Modal-Adversarial Gated Interaction and Consistency-Centric Classifier), a detector that explicitly models and exposes cross-tri-modal consistency signals at multiple granularities. MAGIC3 combines explicit pairwise and global consistency modeling with token- and frame-level consistency signals derived from cross-modal attention, incorporates multi-style LLM rewrites to obtain style-robust text representations, and employs an uncertainty-aware classifier for selective VLM routing. Using pre-extracted features, MAGIC3 consistently outperforms the strongest non-VLM baselines on FakeSV and FakeTT. While matching VLM-level accuracy, the two-stage system achieves 18-27x higher throughput and 93% VRAM savings, offering a strong cost-performance tradeoff.
翻译:短视频平台既是新闻传播的主要渠道,也是多模态虚假信息的滋生地——每种模态单独看似合理,但跨模态关系却存在微妙的不一致性,例如视觉与字幕不匹配。基于FakeSV(中文)和FakeTT(英文)两个基准数据集,我们观察到明显的非对称性:真实视频呈现高文本-视觉一致性但中等文本-音频一致性,而虚假视频则呈现相反模式。此外,单一全局一致性得分可构成可解释轴,虚假概率与预测误差沿此轴平滑变化。受这些观察启发,我们提出MAGIC3(模态对抗门控交互与一致性中心分类器),该检测器显式建模并揭示多粒度跨三模态一致性信号。MAGIC3将显式成对与全局一致性建模相结合,通过跨模态注意力导出令牌级与帧级一致性信号,引入多风格大语言模型改写以获取风格鲁棒的文本表示,并采用不确定性感知分类器进行选择性视觉语言模型路由。基于预提取特征,MAGIC3在FakeSV与FakeTT上持续超越最强非VLM基线。在匹配VLM级精度的同时,该两阶段系统实现18-27倍吞吐量提升与93%显存节省,提供了优异的性价比权衡。