Multimodal Graph Learning for Deepfake Detection

Existing deepfake detectors face several challenges in achieving robustness and generalization. One of the primary reasons is their limited ability to extract relevant information from forgery videos, especially in the presence of various artifacts such as spatial, frequency, temporal, and landmark mismatches. Current detectors rely on pixel-level features that are easily affected by unknown disturbances or facial landmarks that do not provide sufficient information. Furthermore, most detectors cannot utilize information from multiple domains for detection, leading to limited effectiveness in identifying deepfake videos. To address these limitations, we propose a novel framework, namely Multimodal Graph Learning (MGL) that leverages information from multiple modalities using two GNNs and several multimodal fusion modules. At the frame level, we employ a bi-directional cross-modal transformer and an adaptive gating mechanism to combine the features from the spatial and frequency domains with the geometric-enhanced landmark features captured by a GNN. At the video level, we use a Graph Attention Network (GAT) to represent each frame in a video as a node in a graph and encode temporal information into the edges of the graph to extract temporal inconsistency between frames. Our proposed method aims to effectively identify and utilize distinguishing features for deepfake detection. We evaluate the effectiveness of our method through extensive experiments on widely-used benchmarks and demonstrate that our method outperforms the state-of-the-art detectors in terms of generalization ability and robustness against unknown disturbances.

翻译：现有深度伪造检测器在实现鲁棒性和泛化能力方面面临若干挑战。其中一个主要原因在于，它们从伪造视频中提取相关信息的能力有限，尤其是在存在空间、频率、时间及地标不匹配等多种伪影的情况下。当前检测器依赖于易受未知干扰影响的像素级特征，或无法提供足够信息的面部地标。此外，大多数检测器无法利用来自多个领域的信息进行检测，导致在识别深度伪造视频方面效果有限。为解决上述限制，我们提出了一种新颖框架，即多模态图学习（MGL），该框架通过两个图神经网络（GNN）及多个多模态融合模块，利用来自多种模态的信息。在帧级别，我们采用双向跨模态变换器与自适应门控机制，将空间和频率域的特征与经GNN捕获的几何增强地标特征相融合。在视频级别，我们使用图注意力网络（GAT）将视频中的每一帧表示为图中的一个节点，并将时间信息编码到图的边中，以提取帧间的时间不一致性。我们提出的方法旨在有效识别并利用区分性特征进行深度伪造检测。通过在广泛使用的基准数据集上进行大量实验，我们评估了所提方法的有效性，并证明其在泛化能力和对未知干扰的鲁棒性方面优于最先进的检测器。