Currently, high-fidelity text-to-image models are developed in an accelerating pace. Among them, Diffusion Models have led to a remarkable improvement in the quality of image generation, making it vary challenging to distinguish between real and synthesized images. It simultaneously raises serious concerns regarding privacy and security. Some methods are proposed to distinguish the diffusion model generated images through reconstructing. However, the inversion and denoising processes are time-consuming and heavily reliant on the pre-trained generative model. Consequently, if the pre-trained generative model meet the problem of out-of-domain, the detection performance declines. To address this issue, we propose a universal synthetic image detector Time Step Generating (TSG), which does not rely on pre-trained models' reconstructing ability, specific datasets, or sampling algorithms. Our method utilizes a pre-trained diffusion model's network as a feature extractor to capture fine-grained details, focusing on the subtle differences between real and synthetic images. By controlling the time step t of the network input, we can effectively extract these distinguishing detail features. Then, those features can be passed through a classifier (i.e. Resnet), which efficiently detects whether an image is synthetic or real. We test the proposed TSG on the large-scale GenImage benchmark and it achieves significant improvements in both accuracy and generalizability.
翻译:当前,高保真文本到图像模型正以前所未有的速度发展。其中,扩散模型显著提升了图像生成质量,使得区分真实图像与合成图像变得极具挑战性,同时也引发了严重的隐私与安全隐患。现有方法主要通过重建过程来识别扩散模型生成的图像,然而逆过程与去噪过程耗时较长,且高度依赖预训练的生成模型。因此,当预训练生成模型面临域外问题时,检测性能会显著下降。为解决这一难题,本文提出一种通用的合成图像检测器——时间步生成(TSG),该方法不依赖于预训练模型的重建能力、特定数据集或采样算法。我们利用预训练扩散模型的网络作为特征提取器,通过控制网络输入的时间步t,有效捕捉真实图像与合成图像间的细微差异特征。这些特征随后可输入分类器(如Resnet),高效判别图像的真实性。我们在大规模GenImage基准测试中验证了所提TSG方法,其在检测精度与泛化能力方面均取得了显著提升。