PVLM：基于动态对比学习的解析感知视觉语言模型用于零样本深度伪造溯源 (PVLM: Parsing-Aware Vision Language Model with Dynamic Contrastive Learning for Zero-Shot Deepfake Attribution)

The challenge of tracing the source attribution of forged faces has gained significant attention due to the rapid advancement of generative models. However, existing deepfake attribution (DFA) works primarily focus on the interaction among various domains in vision modality, and other modalities such as texts and face parsing are not fully explored. Besides, they tend to fail to assess the generalization performance of deepfake attributors to unseen advanced generators like diffusion in a fine-grained manner. In this paper, we propose a novel parsing-aware vision language model with a dynamic contrastive learning (PVLM) method for zero-shot deepfake attribution (ZSDFA), which facilitates effective and fine-grained traceability to unseen advanced generators. Specifically, we conduct a novel and fine-grained ZS-DFA benchmark to evaluate the attribution performance of deepfake attributors to unseen advanced generators like diffusion. Besides, we propose an innovative PVLM attributor based on the vision-language model to capture general and diverse attribution features. We are motivated by the observation that the preservation of source face attributes in facial images generated by GAN and diffusion models varies significantly. We propose to employ the inherent facial attributes preservation differences to capture face parsing-aware forgery representations. Therefore, we devise a novel parsing encoder to focus on global face attribute embeddings, enabling parsing-guided DFA representation learning via dynamic vision-parsing matching. Additionally, we present a novel deepfake attribution contrastive center loss to pull relevant generators closer and push irrelevant ones away, which can be introduced into DFA models to enhance traceability. Experimental results show that our model exceeds the state-of-the-art on the ZS-DFA benchmark via various protocol evaluations.

翻译：随着生成模型的快速发展，追踪伪造人脸图像的来源归属问题已引起广泛关注。然而，现有的深度伪造溯源方法主要关注视觉模态中不同域之间的交互，而文本、人脸解析等其他模态尚未得到充分探索。此外，现有方法难以以细粒度方式评估深度伪造溯源模型对未见过的先进生成器（如扩散模型）的泛化性能。本文提出一种基于动态对比学习的解析感知视觉语言模型，用于零样本深度伪造溯源任务，该方法能够实现对未见过的先进生成器进行有效且细粒度的溯源。具体而言，我们构建了一个新颖的细粒度零样本深度伪造溯源基准，用于评估溯源模型对扩散模型等未见生成器的归属性能。此外，我们提出一种基于视觉语言模型的创新性PVLM溯源器，以捕捉通用且多样化的溯源特征。我们的动机源于观察到GAN和扩散模型生成的人脸图像在源人脸属性保留程度上存在显著差异。我们利用这种固有的面部属性保留差异来捕获人脸解析感知的伪造表征。为此，我们设计了一种新颖的解析编码器来聚焦于全局人脸属性嵌入，通过动态视觉-解析匹配实现解析引导的深度伪造溯源表征学习。此外，我们提出一种新颖的深度伪造溯源对比中心损失，以拉近相关生成器的表征距离并推开不相关的生成器，该损失可引入现有深度伪造溯源模型以增强可追溯性。实验结果表明，通过多种协议评估，我们的模型在零样本深度伪造溯源基准上超越了现有最优方法。