This study evaluates the effectiveness of Vision Language Models (VLMs) in representing and utilizing multimodal content for fact-checking. To be more specific, we investigate whether incorporating multimodal content improves performance compared to text-only models and how well VLMs utilize text and image information to enhance misinformation detection. Furthermore we propose a probing classifier based solution using VLMs. Our approach extracts embeddings from the last hidden layer of selected VLMs and inputs them into a neural probing classifier for multi-class veracity classification. Through a series of experiments on two fact-checking datasets, we demonstrate that while multimodality can enhance performance, fusing separate embeddings from text and image encoders yielded superior results compared to using VLM embeddings. Furthermore, the proposed neural classifier significantly outperformed KNN and SVM baselines in leveraging extracted embeddings, highlighting its effectiveness for multimodal fact-checking.
翻译:本研究评估了视觉语言模型(VLMs)在表示和利用多模态内容进行事实核查方面的有效性。具体而言,我们探究了与纯文本模型相比,引入多模态内容是否能提升性能,以及VLMs如何有效利用文本和图像信息来增强虚假信息检测。此外,我们提出了一种基于VLMs的探测分类器解决方案。我们的方法从选定VLMs的最后一个隐藏层提取嵌入向量,并将其输入到一个用于多类别真实性分类的神经探测分类器中。通过在两个事实核查数据集上进行的一系列实验,我们证明,虽然多模态性能够提升性能,但与直接使用VLM嵌入相比,融合来自文本编码器和图像编码器的独立嵌入向量能产生更优的结果。此外,所提出的神经分类器在利用提取的嵌入向量方面显著优于KNN和SVM基线方法,突显了其在多模态事实核查任务中的有效性。