The recent success of text-to-image synthesis has taken the world by storm and captured the general public's imagination. From a technical standpoint, it also marked a drastic change in the favored architecture to design generative image models. GANs used to be the de facto choice, with techniques like StyleGAN. With DALL-E 2, auto-regressive and diffusion models became the new standard for large-scale generative models overnight. This rapid shift raises a fundamental question: can we scale up GANs to benefit from large datasets like LAION? We find that na\"Ively increasing the capacity of the StyleGAN architecture quickly becomes unstable. We introduce GigaGAN, a new GAN architecture that far exceeds this limit, demonstrating GANs as a viable option for text-to-image synthesis. GigaGAN offers three major advantages. First, it is orders of magnitude faster at inference time, taking only 0.13 seconds to synthesize a 512px image. Second, it can synthesize high-resolution images, for example, 16-megapixel pixels in 3.66 seconds. Finally, GigaGAN supports various latent space editing applications such as latent interpolation, style mixing, and vector arithmetic operations.
翻译:文本到图像合成的最新成功席卷全球,激发了公众的想象力。从技术角度看,这也标志着生成图像模型首选架构发生了巨大变化。GAN曾是事实上的选择,例如StyleGAN等技术。随着DALL-E 2的出现,自回归和扩散模型一夜之间成为大规模生成模型的新标准。这种快速转变引发了一个根本性问题:我们能否扩展GAN以受益于像LAION这样的大型数据集?我们发现,简单地增加StyleGAN架构的容量会迅速变得不稳定。我们引入了GigaGAN,一种远超这一限制的新GAN架构,证明了GAN作为文本到图像合成的可行选择。GigaGAN具有三大优势。首先,它在推理时速度快数个数量级,仅需0.13秒即可合成一张512像素的图像。其次,它能合成高分辨率图像,例如在3.66秒内合成1600万像素的图像。最后,GigaGAN支持多种潜在空间编辑应用,如潜在插值、风格混合和向量算术运算。