Synthesizing high-fidelity complex images from text is challenging. Based on large pretraining, the autoregressive and diffusion models can synthesize photo-realistic images. Although these large models have shown notable progress, there remain three flaws. 1) These models require tremendous training data and parameters to achieve good performance. 2) The multi-step generation design slows the image synthesis process heavily. 3) The synthesized visual features are difficult to control and require delicately designed prompts. To enable high-quality, efficient, fast, and controllable text-to-image synthesis, we propose Generative Adversarial CLIPs, namely GALIP. GALIP leverages the powerful pretrained CLIP model both in the discriminator and generator. Specifically, we propose a CLIP-based discriminator. The complex scene understanding ability of CLIP enables the discriminator to accurately assess the image quality. Furthermore, we propose a CLIP-empowered generator that induces the visual concepts from CLIP through bridge features and prompts. The CLIP-integrated generator and discriminator boost training efficiency, and as a result, our model only requires about 3% training data and 6% learnable parameters, achieving comparable results to large pretrained autoregressive and diffusion models. Moreover, our model achieves 120 times faster synthesis speed and inherits the smooth latent space from GAN. The extensive experimental results demonstrate the excellent performance of our GALIP. Code is available at https://github.com/tobran/GALIP.
翻译:从文本合成高保真复杂图像具有挑战性。基于大规模预训练,自回归模型和扩散模型能够合成照片级逼真的图像。尽管这些大型模型已取得显著进展,但仍存在三个缺陷:1)这些模型需要海量训练数据和参数才能实现良好性能。2)多步生成设计严重拖慢图像合成过程。3)合成的视觉特征难以控制,需要精心设计的提示词。为实现高质量、高效、快速且可控的文本到图像合成,我们提出生成对抗CLIP,即GALIP。GALIP在判别器和生成器中均利用强大的预训练CLIP模型。具体而言,我们提出基于CLIP的判别器,CLIP的复杂场景理解能力使判别器能准确评估图像质量。此外,我们提出CLIP增强型生成器,通过桥接特征和提示词从CLIP中引入视觉概念。集成CLIP的生成器和判别器提升了训练效率,因此我们的模型仅需约3%的训练数据和6%的可学习参数,即可达到与大规模预训练自回归模型和扩散模型相当的结果。此外,我们的模型实现了120倍的合成速度提升,并继承了GAN的平滑潜空间。大量实验结果表明了GALIP的优异性能。代码开源于https://github.com/tobran/GALIP。