Instead of performing text-conditioned denoising in the image domain, latent diffusion models (LDMs) operate in latent space of a variational autoencoder (VAE), enabling more efficient processing at reduced computational costs. However, while the diffusion process has moved to the latent space, the contrastive language-image pre-training (CLIP) models, as used in many image processing tasks, still operate in pixel space. Doing so requires costly VAE-decoding of latent images before they can be processed. In this paper, we introduce Latent-CLIP, a CLIP model that operates directly in the latent space. We train Latent-CLIP on 2.7B pairs of latent images and descriptive texts, and show that it matches zero-shot classification performance of similarly sized CLIP models on both the ImageNet benchmark and a LDM-generated version of it, demonstrating its effectiveness in assessing both real and generated content. Furthermore, we construct Latent-CLIP rewards for reward-based noise optimization (ReNO) and show that they match the performance of their CLIP counterparts on GenEval and T2I-CompBench while cutting the cost of the total pipeline by 21%. Finally, we use Latent-CLIP to guide generation away from harmful content, achieving strong performance on the inappropriate image prompts (I2P) benchmark and a custom evaluation, without ever requiring the costly step of decoding intermediate images.
翻译:潜在扩散模型(LDMs)并非在图像域执行文本条件去噪,而是在变分自编码器(VAE)的潜在空间中操作,从而以更低的计算成本实现更高效的处理。然而,尽管扩散过程已转移至潜在空间,但在许多图像处理任务中使用的对比语言-图像预训练(CLIP)模型仍运行于像素空间。这需要在处理潜在图像前进行昂贵的VAE解码。本文提出Latent-CLIP,一种直接在潜在空间中运行的CLIP模型。我们在27亿对潜在图像与描述性文本上训练Latent-CLIP,并证明其在ImageNet基准测试及其LDM生成版本上均能达到同等规模CLIP模型的零样本分类性能,展示了其评估真实内容与生成内容的有效性。此外,我们构建了用于基于奖励的噪声优化(ReNO)的Latent-CLIP奖励函数,结果表明其在GenEval和T2I-CompBench上的性能与对应CLIP模型相当,同时将整体流程成本降低21%。最后,我们利用Latent-CLIP引导生成过程远离有害内容,在不需解码中间图像的昂贵步骤下,于不当图像提示(I2P)基准测试及自定义评估中实现了强劲性能。