ViGText: Deepfake Image Detection with Vision-Language Model Explanations and Graph Neural Networks

The rapid rise of deepfake technology, which produces realistic but fraudulent digital content, threatens the authenticity of media. Traditional deepfake detection approaches often struggle with sophisticated, customized deepfakes, especially in terms of generalization and robustness against malicious attacks. This paper introduces ViGText, a novel approach that integrates images with Vision Large Language Model (VLLM) Text explanations within a Graph-based framework to improve deepfake detection. The novelty of ViGText lies in its integration of detailed explanations with visual data, as it provides a more context-aware analysis than captions, which often lack specificity and fail to reveal subtle inconsistencies. ViGText systematically divides images into patches, constructs image and text graphs, and integrates them for analysis using Graph Neural Networks (GNNs) to identify deepfakes. Through the use of multi-level feature extraction across spatial and frequency domains, ViGText captures details that enhance its robustness and accuracy to detect sophisticated deepfakes. Extensive experiments demonstrate that ViGText significantly enhances generalization and achieves a notable performance boost when it detects user-customized deepfakes. Specifically, average F1 scores rise from 72.45% to 98.32% under generalization evaluation, and reflects the model's superior ability to generalize to unseen, fine-tuned variations of stable diffusion models. As for robustness, ViGText achieves an increase of 11.1% in recall compared to other deepfake detection approaches. When facing targeted attacks that exploit its graph-based architecture, ViGText limits classification performance degradation to less than 4%. ViGText uses detailed visual and textual analysis to set a new standard for detecting deepfakes, helping ensure media authenticity and information integrity.

翻译：深度伪造技术能够生成逼真但具有欺诈性的数字内容，其快速发展正威胁着媒体内容的真实性。传统的深度伪造检测方法在处理复杂定制化的深度伪造内容时往往面临挑战，尤其在泛化能力和对抗恶意攻击的鲁棒性方面存在不足。本文提出ViGText，一种创新方法，将图像与视觉大语言模型生成的文本解释在基于图的框架中进行整合，以提升深度伪造检测性能。ViGText的核心创新在于将详细解释与视觉数据相结合，相比通常缺乏特异性且难以揭示细微不一致性的图像描述，该方法能提供更具上下文感知能力的分析。ViGText系统地将图像划分为多个区块，构建图像与文本图，并利用图神经网络进行融合分析以识别深度伪造内容。通过在空间域和频域进行多层次特征提取，ViGText能够捕捉到增强其鲁棒性和检测精度的细节信息，从而有效识别复杂的深度伪造内容。大量实验表明，ViGText显著提升了泛化能力，并在检测用户定制化深度伪造内容时实现了显著的性能提升。具体而言，在泛化评估中平均F1分数从72.45%提升至98.32%，这反映了该模型对未见过的稳定扩散模型微调变体具有卓越的泛化能力。在鲁棒性方面，ViGText的召回率相比其他深度伪造检测方法提高了11.1%。当面对针对其图架构设计的定向攻击时，ViGText能将分类性能下降控制在4%以内。ViGText通过细致的视觉与文本分析，为深度伪造检测设立了新标准，有助于保障媒体真实性与信息完整性。