Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust. Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model. We argue that such supervisions lack semantic information and interpretability. To address this issues, in this paper, we propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation. Since text annotations are not available in current deepfakes datasets, VLFFD first generates the mixed forgery image with corresponding fine-grained prompts via Prompt Forgery Image Generator (PFIG). Then, the fine-grained mixed data and coarse-grained original data and is jointly trained with the Coarse-and-Fine Co-training framework (C2F), enabling the model to gain more generalization and interpretability. The experiments show the proposed method improves the existing detection models on several challenging benchmarks. Furthermore, we have integrated our method with multimodal large models, achieving noteworthy results that demonstrate the potential of our approach. This integration not only enhances the performance of our VLFFD paradigm but also underscores the versatility and adaptability of our method when combined with advanced multimodal technologies, highlighting its potential in tackling the evolving challenges of deepfake detection.
翻译:深度伪造是逼真的人脸篡改技术,可能对安全、隐私和信任构成严重威胁。现有方法大多将该任务视为二分类问题,使用数字标签或掩码信号训练检测模型。我们认为此类监督缺乏语义信息和可解释性。为解决这一问题,本文提出一种名为视觉语言人脸伪造检测(VLFFD)的新范式,采用细粒度句子级提示作为标注。由于当前深度伪造数据集缺乏文本标注,VLFFD首先通过提示伪造图像生成器(PFIG)生成带有对应细粒度提示的混合伪造图像。随后,通过粗细粒度联合训练框架(C2F)对细粒度混合数据与粗粒度原始数据进行联合训练,使模型获得更强的泛化能力和可解释性。实验表明,所提方法在多个具有挑战性的基准上提升了现有检测模型的性能。此外,我们将该方法与多模态大模型集成,取得了显著成果,验证了所提范式的潜力。这种集成不仅增强了VLFFD范式的表现力,更凸显了该方法与先进多模态技术结合时的通用性和适应性,彰显其在应对深度伪造检测持续挑战中的潜力。