CapST: An Enhanced and Lightweight Method for Deepfake Video Classification

The proliferation of deepfake videos, synthetic media produced through advanced Artificial Intelligence techniques has raised significant concerns across various sectors, encompassing realms such as politics, entertainment, and security. In response, this research introduces an innovative and streamlined model designed to classify deepfake videos generated by five distinct encoders adeptly. Our approach not only achieves state of the art performance but also optimizes computational resources. At its core, our solution employs part of a VGG19bn as a backbone to efficiently extract features, a strategy proven effective in image-related tasks. We integrate a Capsule Network coupled with a Spatial Temporal attention mechanism to bolster the model's classification capabilities while conserving resources. This combination captures intricate hierarchies among features, facilitating robust identification of deepfake attributes. Delving into the intricacies of our innovation, we introduce an existing video level fusion technique that artfully capitalizes on temporal attention mechanisms. This mechanism serves to handle concatenated feature vectors, capitalizing on the intrinsic temporal dependencies embedded within deepfake videos. By aggregating insights across frames, our model gains a holistic comprehension of video content, resulting in more precise predictions. Experimental results on an extensive benchmark dataset of deepfake videos called DFDM showcase the efficacy of our proposed method. Notably, our approach achieves up to a 4 percent improvement in accurately categorizing deepfake videos compared to baseline models, all while demanding fewer computational resources.

翻译：摘要：深度伪造视频作为一种通过先进人工智能技术生成的合成媒体，其泛滥已在政治、娱乐和安全等多个领域引发重大关切。为此，本研究提出一种创新且精简的模型，旨在精准分类由五种不同编码器生成的深度伪造视频。我们的方法不仅实现了领先性能，还优化了计算资源利用。其核心采用部分VGG19bn作为骨干网络高效提取特征——该策略在图像任务中已证实有效。我们融合胶囊网络与时空注意力机制，在提升模型分类能力的同时节省资源。这种组合能捕获特征间的复杂层级结构，从而实现对深度伪造属性的稳健识别。深入探究创新细节，我们引入一种利用时间注意力机制的现有视频级融合技术。该机制通过处理拼接后的特征向量，巧妙利用深度伪造视频中固有的时间依赖性。通过跨帧聚合信息，模型获得对视频内容的整体性理解，从而实现更精确的预测。在名为DFDM的大规模深度伪造视频基准数据集上的实验结果表明了所提方法的有效性。值得注意的是，与基线模型相比，我们的方法在精准分类深度伪造视频方面实现了最高4%的性能提升，且所需计算资源更少。