This paper investigates the effectiveness of self-supervised pre-trained transformers compared to supervised pre-trained transformers and conventional neural networks (ConvNets) for detecting various types of deepfakes. We focus on their potential for improved generalization, particularly when training data is limited. Despite the notable success of large vision-language models utilizing transformer architectures in various tasks, including zero-shot and few-shot learning, the deepfake detection community has still shown some reluctance to adopt pre-trained vision transformers (ViTs), especially large ones, as feature extractors. One concern is their perceived excessive capacity, which often demands extensive data, and the resulting suboptimal generalization when training or fine-tuning data is small or less diverse. This contrasts poorly with ConvNets, which have already established themselves as robust feature extractors. Additionally, training and optimizing transformers from scratch requires significant computational resources, making this accessible primarily to large companies and hindering broader investigation within the academic community. Recent advancements in using self-supervised learning (SSL) in transformers, such as DINO and its derivatives, have showcased significant adaptability across diverse vision tasks and possess explicit semantic segmentation capabilities. By leveraging DINO for deepfake detection with modest training data and implementing partial fine-tuning, we observe comparable adaptability to the task and the natural explainability of the detection result via the attention mechanism. Moreover, partial fine-tuning of transformers for deepfake detection offers a more resource-efficient alternative, requiring significantly fewer computational resources.
翻译:本文研究了自监督预训练Transformer与监督预训练Transformer及传统神经网络(ConvNets)在检测多种深度伪造类型方面的有效性。我们重点关注其在训练数据有限时提升泛化能力的潜力。尽管采用Transformer架构的大型视觉语言模型在包括零样本和少样本学习在内的多项任务中取得了显著成功,但深度伪造检测领域仍对采用预训练视觉Transformer(ViT),尤其是大型Transformer作为特征提取器持一定保留态度。其潜在担忧在于感知容量过大往往需要大量数据,当训练或微调数据规模较小或多样性不足时,会导致次优的泛化性能。这与已确立为鲁棒特征提取器的ConvNets形成鲜明对比。此外,从头训练和优化Transformer需要大量计算资源,这主要面向大型企业,阻碍了学术界的广泛研究。近期Transformer中自监督学习(SSL)的进展(如DINO及其衍生模型)展示了其在多种视觉任务中的显著适应能力,并具备显式的语义分割能力。通过利用DINO进行深度伪造检测(使用适量训练数据并实施部分微调),我们观察到模型对该任务具有相当的适应性,且通过注意力机制能够自然解释检测结果。此外,针对深度伪造检测对Transformer进行部分微调提供了一种更高效的资源替代方案,所需计算资源显著减少。