The recent success of text-to-image synthesis has taken the world by storm and captured the general public's imagination. From a technical standpoint, it also marked a drastic change in the favored architecture to design generative image models. GANs used to be the de facto choice, with techniques like StyleGAN. With DALL-E 2, auto-regressive and diffusion models became the new standard for large-scale generative models overnight. This rapid shift raises a fundamental question: can we scale up GANs to benefit from large datasets like LAION? We find that na\"Ively increasing the capacity of the StyleGAN architecture quickly becomes unstable. We introduce GigaGAN, a new GAN architecture that far exceeds this limit, demonstrating GANs as a viable option for text-to-image synthesis. GigaGAN offers three major advantages. First, it is orders of magnitude faster at inference time, taking only 0.13 seconds to synthesize a 512px image. Second, it can synthesize high-resolution images, for example, 16-megapixel pixels in 3.66 seconds. Finally, GigaGAN supports various latent space editing applications such as latent interpolation, style mixing, and vector arithmetic operations.
翻译:文本到图像合成的最新成果风靡全球,激发了大众的想象力。从技术角度看,这一进展也标志着生成图像模型首选架构的剧烈转变。生成对抗网络(GAN)曾凭借StyleGAN等技术而成为事实上的标准,但DALL-E 2的出现使自回归模型和扩散模型一夜之间成为大规模生成模型的新范式。这一快速转变引出了一个根本性问题:我们能否扩展GAN以利用LAION这样的大规模数据集?我们发现,直接增加StyleGAN架构的容量会迅速导致不稳定问题。为此,我们提出GigaGAN——一种远超此限制的新型GAN架构,证明了GAN在文本到图像合成中的可行性。GigaGAN具有三大优势:首先,其推理速度提升数个数量级,仅需0.13秒即可合成512像素图像;其次,它能生成高分辨率图像,例如在3.66秒内合成1600万像素图像;最后,GigaGAN支持多种潜空间编辑操作,如潜空间插值、风格混合及向量算术运算。