KGAlign：面向多模态虚假新闻检测的语义-结构知识联合编码 (KGAlign: Joint Semantic-Structural Knowledge Encoding for Multimodal Fake News Detection)

Fake news detection remains a challenging problem due to the complex interplay between textual misinformation, manipulated images, and external knowledge reasoning. While existing approaches have achieved notable results in verifying veracity and cross-modal consistency, two key challenges persist: (1) Existing methods often consider only the global image context while neglecting local object-level details, and (2) they fail to incorporate external knowledge and entity relationships for deeper semantic understanding. To address these challenges, we propose a novel multi-modal fake news detection framework that integrates visual, textual, and knowledge-based representations. Our approach leverages bottom-up attention to capture fine-grained object details, CLIP for global image semantics, and RoBERTa for context-aware text encoding. We further enhance knowledge utilization by retrieving and adaptively selecting relevant entities from a knowledge graph. The fused multi-modal features are processed through a Transformer-based classifier to predict news veracity. Experimental results demonstrate that our model outperforms recent approaches, showcasing the effectiveness of neighbor selection mechanism and multi-modal fusion for fake news detection. Our proposal introduces a new paradigm: knowledge-grounded multimodal reasoning. By integrating explicit entity-level selection and NLI-guided filtering, we shift fake news detection from feature fusion to semantically grounded verification. For reproducibility and further research, the source code is publicly at \href{https://github.com/latuanvinh1998/KGAlign}{github.com/latuanvinh1998/KGAlign}.

翻译：虚假新闻检测因文本误导信息、篡改图像与外部知识推理间的复杂交互而持续面临挑战。现有方法虽在验证真实性及跨模态一致性方面取得显著成果，但仍存在两大关键问题：（1）现有方法通常仅考虑全局图像语境而忽略局部对象级细节；（2）未能融合外部知识与实体关系以实现更深层次的语义理解。为应对这些挑战，我们提出一种融合视觉、文本与知识表征的新型多模态虚假新闻检测框架。该方法通过自底向上注意力机制捕捉细粒度对象细节，利用CLIP提取全局图像语义，并采用RoBERTa进行语境感知的文本编码。我们进一步通过从知识图谱中检索并自适应选择相关实体来增强知识利用。融合后的多模态特征经由基于Transformer的分类器处理以预测新闻真实性。实验结果表明，本模型性能优于现有最新方法，验证了邻居选择机制与多模态融合在虚假新闻检测中的有效性。本方案提出了一种新范式：基于知识的跨模态推理。通过整合显式实体级选择与自然语言推理引导的过滤机制，我们将虚假新闻检测从特征融合提升至语义基验证的层面。为保障可复现性及后续研究，源代码已公开于 \href{https://github.com/latuanvinh1998/KGAlign}{github.com/latuanvinh1998/KGAlign}。