State-of-the-art approaches rely on image-based features extracted via neural networks for the deepfake detection binary classification. While these approaches trained in the supervised sense extract likely fake features, they may fall short in representing unnatural `non-physical' semantic facial attributes -- blurry hairlines, double eyebrows, rigid eye pupils, or unnatural skin shading. However, such facial attributes are generally easily perceived by humans via common sense reasoning. Furthermore, image-based feature extraction methods that provide visual explanation via saliency maps can be hard to be interpreted by humans. To address these challenges, we propose the use of common sense reasoning to model deepfake detection, and extend it to the Deepfake Detection VQA (DD-VQA) task with the aim to model human intuition in explaining the reason behind labeling an image as either real or fake. To this end, we introduce a new dataset that provides answers to the questions related to the authenticity of an image, along with its corresponding explanations. We also propose a Vision and Language Transformer-based framework for the DD-VQA task, incorporating text and image aware feature alignment formulations. Finally, we evaluate our method on both the performance of deepfake detection and the quality of the generated explanations. We hope that this task inspires researchers to explore new avenues for enhancing language-based interpretability and cross-modality applications in the realm of deepfake detection.
翻译:最先进的方法依赖于通过神经网络提取的图像特征,用于深度伪造检测的二元分类。尽管这些在监督学习下训练的方法能够提取可能的伪造特征,但它们在表征不自然的“非物理”语义面部属性方面存在不足——例如模糊的发际线、双重眉毛、僵硬的瞳孔或不自然的皮肤阴影。然而,这类面部属性通常可被人类通过常识推理轻松感知。此外,基于图像的特征提取方法通过显著性图提供视觉解释,往往难以被人类解读。为应对这些挑战,我们提出利用常识推理来建模深度伪造检测,并将其扩展至深度伪造检测视觉问答(DD-VQA)任务,旨在模拟人类在解释图像标注为真实或伪造原因时的直觉。为此,我们引入了一个新数据集,该数据集提供与图像真实性相关问题的答案及其相应解释。我们还提出一个基于视觉语言Transformer的框架用于DD-VQA任务,融合了文本与图像感知的特征对齐机制。最后,我们从深度伪造检测性能和生成解释质量两方面评估了我们的方法。我们希望此任务能激励研究者探索增强深度伪造检测中基于语言的可解释性和跨模态应用的新途径。