Although the capacity of deep generative models for image generation, such as Diffusion Models (DMs) and Generative Adversarial Networks (GANs), has dramatically improved in recent years, much of their success can be attributed to computationally expensive architectures. This has limited their adoption and use to research laboratories and companies with large resources, while significantly raising the carbon footprint for training, fine-tuning, and inference. In this work, we present LadaGAN, an efficient generative adversarial network that is built upon a novel Transformer block named Ladaformer. The main component of this block is a linear additive-attention mechanism that computes a single attention vector per head instead of the quadratic dot-product attention. We employ Ladaformer in both the generator and discriminator, which reduces the computational complexity and overcomes the training instabilities often associated with Transformer GANs. LadaGAN consistently outperforms existing convolutional and Transformer GANs on benchmark datasets at different resolutions while being significantly more efficient. Moreover, LadaGAN shows competitive performance compared to state-of-the-art multi-step generative models (e.g. DMs) using orders of magnitude less computational resources.
翻译:尽管深度生成模型(如扩散模型和生成对抗网络)在图像生成方面的能力近年来显著提升,但其成功很大程度上依赖于计算成本高昂的架构。这限制了它们仅能被拥有大量资源的研究实验室和公司所采用,同时显著增加了训练、微调和推理过程中的碳足迹。本文提出LadaGAN,这是一种高效的生成对抗网络,其构建于名为Ladaformer的新型Transformer模块之上。该模块的核心组件是线性加性注意力机制,它计算每个注意力头生成单个注意力向量,而非二次点积注意力。我们在生成器和判别器中均采用Ladaformer,这降低了计算复杂度,并克服了Transformer GAN常伴随的训练不稳定问题。在不同分辨率的基准数据集上,LadaGAN始终优于现有的卷积和Transformer GAN,同时计算效率显著更高。此外,与最先进的多步生成模型(如扩散模型)相比,LadaGAN在计算资源消耗降低数个数量级的情况下,仍展现出具有竞争力的性能。