Misinformation on the web increasingly appears in multimodal forms, combining text, images, and OCR-rendered content in ways that amplify harm to public trust and vulnerable communities. While prior fact-checking systems often rely on unimodal signals or shallow fusion strategies, modern misinformation campaigns operate across modalities and require models that can reason over subtle cross-modal inconsistencies in a transparent and responsible manner. We introduce MultiCheck, a lightweight and interpretable framework for multimodal fact verification that jointly analyzes textual, visual, and OCR evidence. At its core, MultiCheck employs a relational fusion module based on element-wise difference and product operations, allowing for explicit cross-modal interaction modeling with minimal computational overhead. A contrastive alignment objective further helps the model distinguish between supporting and refuting evidence while maintaining a small memory and energy footprint, making it suitable for low-resource deployment. Evaluated on the Factify-2 (5-class) and Mocheg (3-class) benchmarks, MultiCheck achieves huge performance improvement and remains robust under noisy OCR and missing modality conditions. Its efficiency, transparency, and real-world robustness make it well-suited for journalists, civil society organisations, and web integrity efforts working to build a safer and more trustworthy web.
翻译:网络上的虚假信息日益以多模态形式出现,将文本、图像和OCR渲染内容相结合,加剧了对公众信任和弱势群体的危害。尽管先前的事实核查系统通常依赖单模态信号或浅层融合策略,但现代虚假信息活动跨越多种模态运作,需要能够以透明和负责任的方式推理细微跨模态不一致性的模型。我们提出了MultiCheck,一个轻量级且可解释的多模态事实核查框架,可联合分析文本、视觉和OCR证据。其核心是一个基于逐元素差值与乘积运算的关系融合模块,能够以最小的计算开销实现显式的跨模态交互建模。对比对齐目标进一步帮助模型区分支持性和反驳性证据,同时保持较小的内存和能耗,使其适合低资源部署。在Factify-2(5类)和Mocheg(3类)基准测试中,MultiCheck实现了显著的性能提升,并在噪声OCR和模态缺失条件下保持鲁棒性。其高效性、透明性和现实世界鲁棒性使其非常适合致力于构建更安全、更可信网络的记者、民间社会组织和网络诚信工作。