CapST: An Enhanced and Lightweight Model Attribution Approach for Synthetic Videos

Deepfake videos, generated through AI faceswapping techniques, have garnered considerable attention due to their potential for powerful impersonation attacks. While existing research primarily focuses on binary classification to discern between real and fake videos, however determining the specific generation model for a fake video is crucial for forensic investigation. Addressing this gap, this paper investigates the model attribution problem of Deepfake videos from a recently proposed dataset, Deepfakes from Different Models (DFDM), derived from various Autoencoder models. The dataset comprises 6,450 Deepfake videos generated by five distinct models with variations in encoder, decoder, intermediate layer, input resolution, and compression ratio. This study formulates Deepfakes model attribution as a multiclass classification task, proposing a segment of VGG19 as a feature extraction backbone, known for its effectiveness in imagerelated tasks, while integrated a Capsule Network with a Spatio-Temporal attention mechanism. The Capsule module captures intricate hierarchies among features for robust identification of deepfake attributes. Additionally, the video-level fusion technique leverages temporal attention mechanisms to handle concatenated feature vectors, capitalizing on inherent temporal dependencies in deepfake videos. By aggregating insights across frames, our model gains a comprehensive understanding of video content, resulting in more precise predictions. Experimental results on the deepfake benchmark dataset (DFDM) demonstrate the efficacy of our proposed method, achieving up to a 4% improvement in accurately categorizing deepfake videos compared to baseline models while demanding fewer computational resources.

翻译：基于AI换脸技术生成的深度伪造视频因其强大的冒充攻击潜力而备受关注。现有研究主要集中于真假视频的二分类判别，然而确定伪造视频的具体生成模型对于取证调查至关重要。针对这一空白，本文研究了源自不同自编码器模型的新数据集DFDM（Deepfakes from Different Models）中深度伪造视频的模型归属问题。该数据集包含由五种不同模型生成的6450个深度伪造视频，这些模型在编码器、解码器、中间层、输入分辨率和压缩率方面存在差异。本研究将深度伪造模型归属问题形式化为多分类任务，提出采用VGG19的一段网络作为特征提取骨干（该网络在图像相关任务中表现出色），同时集成具有时空注意力机制的胶囊网络。胶囊模块通过捕捉特征间的复杂层级关系实现深度伪造属性的鲁棒识别。此外，视频级融合技术利用时间注意力机制处理拼接后的特征向量，充分利用深度伪造视频固有的时间依赖性。通过聚合各帧的洞察信息，本模型能全面理解视频内容从而获得更精确的预测结果。在深度伪造基准数据集DFDM上的实验表明，所提方法在准确分类深度伪造视频方面相比基线模型提升高达4%，同时所需计算资源更少。