Artificial Intelligence Generated Content (AIGC) techniques, represented by text-to-image generation, have led to a malicious use of deep forgeries, raising concerns about the trustworthiness of multimedia content. Adapting traditional forgery detection methods to diffusion models proves challenging. Thus, this paper proposes a forgery detection method explicitly designed for diffusion models called Trinity Detector. Trinity Detector incorporates coarse-grained text features through a CLIP encoder, coherently integrating them with fine-grained artifacts in the pixel domain for comprehensive multimodal detection. To heighten sensitivity to diffusion-generated image features, a Multi-spectral Channel Attention Fusion Unit (MCAF) is designed, extracting spectral inconsistencies through adaptive fusion of diverse frequency bands and further integrating spatial co-occurrence of the two modalities. Extensive experimentation validates that our Trinity Detector method outperforms several state-of-the-art methods, our performance is competitive across all datasets and up to 17.6\% improvement in transferability in the diffusion datasets.
翻译:以文本到图像生成为代表的人工智能生成内容技术导致了深度伪造的恶意滥用,引发了对多媒体内容可信度的担忧。传统伪造检测方法难以适应扩散模型,因此本文提出了一种专门面向扩散模型的伪造检测方法——Trinity Detector。该方法通过CLIP编码器提取粗粒度文本特征,并将其与像素域中的细粒度伪影连贯融合,实现多模态综合检测。为增强对扩散生成图像特征的敏感性,我们设计了多频谱通道注意力融合单元,通过自适应融合不同频带提取频谱不一致性,并进一步整合两种模态的空间共现性。大量实验验证表明,我们的Trinity Detector方法在性能上优于多种最新方法,在所有数据集上均具有竞争力,且在扩散数据集上的迁移性提升高达17.6%。