VMID: A Multimodal Fusion LLM Framework for Detecting and Identifying Misinformation of Short Videos

Short video platforms have become important channels for news dissemination, offering a highly engaging and immediate way for users to access current events and share information. However, these platforms have also emerged as significant conduits for the rapid spread of misinformation, as fake news and rumors can leverage the visual appeal and wide reach of short videos to circulate extensively among audiences. Existing fake news detection methods mainly rely on single-modal information, such as text or images, or apply only basic fusion techniques, limiting their ability to handle the complex, multi-layered information inherent in short videos. To address these limitations, this paper presents a novel fake news detection method based on multimodal information, designed to identify misinformation through a multi-level analysis of video content. This approach effectively utilizes different modal representations to generate a unified textual description, which is then fed into a large language model for comprehensive evaluation. The proposed framework successfully integrates multimodal features within videos, significantly enhancing the accuracy and reliability of fake news detection. Experimental results demonstrate that the proposed approach outperforms existing models in terms of accuracy, robustness, and utilization of multimodal information, achieving an accuracy of 90.93%, which is significantly higher than the best baseline model (SV-FEND) at 81.05%. Furthermore, case studies provide additional evidence of the effectiveness of the approach in accurately distinguishing between fake news, debunking content, and real incidents, highlighting its reliability and robustness in real-world applications.

翻译：短视频平台已成为新闻传播的重要渠道，为用户获取时事和分享信息提供了高度互动且即时的方式。然而，这些平台也已成为虚假信息快速传播的重要渠道，因为虚假新闻和谣言可以利用短视频的视觉吸引力和广泛覆盖力在受众中广泛传播。现有的虚假新闻检测方法主要依赖单模态信息（如文本或图像），或仅应用基本的融合技术，限制了其处理短视频中固有的复杂、多层次信息的能力。为解决这些局限性，本文提出了一种基于多模态信息的新型虚假新闻检测方法，旨在通过对视频内容进行多层次分析来识别虚假信息。该方法有效利用不同模态的表示生成统一的文本描述，随后输入大语言模型进行综合评估。所提出的框架成功整合了视频内的多模态特征，显著提升了虚假新闻检测的准确性和可靠性。实验结果表明，所提方法在准确性、鲁棒性和多模态信息利用方面优于现有模型，达到了90.93%的准确率，显著高于最佳基线模型（SV-FEND）的81.05%。此外，案例研究进一步证明了该方法在准确区分虚假新闻、辟谣内容和真实事件方面的有效性，凸显了其在现实应用中的可靠性和鲁棒性。