The growing threats of deepfakes to society and cybersecurity have raised enormous public concerns, and increasing efforts have been devoted to this critical topic of deepfake video detection. Existing video methods achieve good performance but are computationally intensive. This paper introduces a simple yet effective strategy named Thumbnail Layout (TALL), which transforms a video clip into a pre-defined layout to realize the preservation of spatial and temporal dependencies. Specifically, consecutive frames are masked in a fixed position in each frame to improve generalization, then resized to sub-images and rearranged into a pre-defined layout as the thumbnail. TALL is model-agnostic and extremely simple by only modifying a few lines of code. Inspired by the success of vision transformers, we incorporate TALL into Swin Transformer, forming an efficient and effective method TALL-Swin. Extensive experiments on intra-dataset and cross-dataset validate the validity and superiority of TALL and SOTA TALL-Swin. TALL-Swin achieves 90.79$\%$ AUC on the challenging cross-dataset task, FaceForensics++ $\to$ Celeb-DF. The code is available at https://github.com/rainy-xu/TALL4Deepfake.
翻译:摘要:深度伪造对社会和网络安全构成的日益严重的威胁引发了公众的高度关注,越来越多的研究致力于这一关键课题——深度伪造视频检测。现有视频方法虽性能优异,但计算开销较大。本文提出一种简单而高效的策略——缩略图布局(TALL),该方法将视频片段转换为预定义布局,从而保留时空依赖性。具体而言,在每帧中固定位置对连续帧进行掩码处理以提升泛化能力,随后将帧调整为子图像并重组为预定义布局形成缩略图。TALL具有模型无关性,仅需修改少量代码即可实现极简部署。受视觉Transformer成功的启发,我们将TALL集成至Swin Transformer中,构建出高效且有效的TALL-Swin方法。在数据集内与跨数据集上的大量实验验证了TALL及当前最优方法TALL-Swin的有效性与优越性。TALL-Swin在具有挑战性的跨数据集任务FaceForensics++ → Celeb-DF上达到90.79%的AUC值。代码已开源:https://github.com/rainy-xu/TALL4Deepfake。