Undercover Deepfakes: Detecting Fake Segments in Videos

The recent renaissance in generative models, driven primarily by the advent of diffusion models and iterative improvement in GAN methods, has enabled many creative applications. However, each advancement is also accompanied by a rise in the potential for misuse. In the arena of the deepfake generation, this is a key societal issue. In particular, the ability to modify segments of videos using such generative techniques creates a new paradigm of deepfakes which are mostly real videos altered slightly to distort the truth.This paradigm has been under-explored by the current deepfake detection methods in the academic literature. In this paper, we present a deepfake detection method that can address this issue by performing deepfake prediction at the frame and video levels. To facilitate testing our method, we prepared a new benchmark dataset where videos have both real and fake frame sequences with very subtle transitions. We provide a benchmark on the proposed dataset with our detection method which utilizes the Vision Transformer based on Scaling and Shifting to learn spatial features, and a Timeseries Transformer to learn temporal features of the videos to help facilitate the interpretation of possible deepfakes. Extensive experiments on a variety of deepfake generation methods show excellent results by the proposed method on temporal segmentation and classical video-level predictions as well. In particular, the paradigm we address will form a powerful tool for the moderation of deepfakes, where human oversight can be better targeted to the parts of videos suspected of being deepfakes. All experiments can be reproduced at: https://t.ly/\_bOh9.

翻译：近期以扩散模型的出现和GAN方法的迭代改进为主要驱动的生成模型复兴，催生了诸多创造性应用。然而，每项技术进步同时也伴随着滥用风险的上升。在深度伪造生成领域，这已成为关键的社会议题。尤其值得关注的是，利用此类生成技术修改视频片段的能力，开创了一种新型深度伪造范式——这类伪造视频主体为真实内容，仅经过细微修改以达到扭曲事实的目的。当前学术界针对深度伪造的检测方法对这一范式的研究尚不充分。本文提出一种面向视频帧级和视频级预测的深度伪造检测方法，可有效应对该问题。为验证方法有效性，我们构建了包含真实与伪造帧序列且具有极细微过渡的新基准数据集。基于该数据集，我们采用基于缩放与移位机制的视觉Transformer学习空间特征，结合时序Transformer提取视频时序特征，通过所提出的检测方法完成基准测试。在多种深度伪造生成方法上的大量实验表明，本方法在时序分割与经典视频级预测任务中均取得优异效果。特别需要指出的是，我们所研究的范式将成为深度伪造审核的有力工具，使人工审查能更精准地聚焦于视频中疑似伪造的部分。所有实验均可通过 https://t.ly/\_bOh9 复现。