This paper investigates the effectiveness of self-supervised pre-trained vision transformers (ViTs) compared to supervised pre-trained ViTs and conventional neural networks (ConvNets) for detecting facial deepfake images and videos. It examines their potential for improved generalization and explainability, especially with limited training data. Despite the success of transformer architectures in various tasks, the deepfake detection community is hesitant to use large ViTs as feature extractors due to their perceived need for extensive data and suboptimal generalization with small datasets. This contrasts with ConvNets, which are already established as robust feature extractors. Additionally, training ViTs from scratch requires significant resources, limiting their use to large companies. Recent advancements in self-supervised learning (SSL) for ViTs, like masked autoencoders and DINOs, show adaptability across diverse tasks and semantic segmentation capabilities. By leveraging SSL ViTs for deepfake detection with modest data and partial fine-tuning, we find comparable adaptability to deepfake detection and explainability via the attention mechanism. Moreover, partial fine-tuning of ViTs is a resource-efficient option.
翻译:本文研究了自监督预训练视觉Transformer(ViT)相较于监督预训练ViT及传统卷积神经网络(ConvNet)在检测面部深度伪造图像与视频方面的有效性,并探讨了其在有限训练数据条件下提升泛化能力与可解释性的潜力。尽管Transformer架构在多种任务中取得了成功,但深度伪造检测领域对于使用大型ViT作为特征提取器仍持谨慎态度,原因在于其被认为需要大量数据且在小数据集上泛化性能欠佳。这与已确立为稳健特征提取器的ConvNet形成对比。此外,从头训练ViT需要大量计算资源,限制了其仅能由大型公司使用。近期在ViT自监督学习(SSL)方面的进展,如掩码自编码器和DINO,显示出其在多样化任务中的适应能力及语义分割潜力。通过利用SSL ViT在适度数据与部分微调条件下进行深度伪造检测,我们发现其在深度伪造检测任务中具有相当的适应性,并能通过注意力机制提供可解释性。此外,对ViT进行部分微调是一种资源高效的可行方案。