Transformer becomes prevalent in computer vision, especially for high-level vision tasks. However, adopting Transformer in the generative adversarial network (GAN) framework is still an open yet challenging problem. In this paper, we conduct a comprehensive empirical study to investigate the properties of Transformer in GAN for high-fidelity image synthesis. Our analysis highlights and reaffirms the importance of feature locality in image generation, although the merits of the locality are well known in the classification task. Perhaps more interestingly, we find the residual connections in self-attention layers harmful for learning Transformer-based discriminators and conditional generators. We carefully examine the influence and propose effective ways to mitigate the negative impacts. Our study leads to a new alternative design of Transformers in GAN, a convolutional neural network (CNN)-free generator termed as STrans-G, which achieves competitive results in both unconditional and conditional image generations. The Transformer-based discriminator, STrans-D, also significantly reduces its gap against the CNN-based discriminators.
翻译:Transformer在计算机视觉领域逐渐普及,尤其在高层视觉任务中表现突出。然而,在生成对抗网络(GAN)框架中应用Transformer仍是一个开放且具有挑战性的问题。本文通过全面的实证研究,深入探究Transformer在GAN中用于高保真图像合成的特性。我们的分析强调并再次确认了特征局部性在图像生成中的重要性,尽管该特性在分类任务中已广为人知。更有趣的是,我们发现自注意力层中的残差连接对基于Transformer的判别器和条件生成器的学习有负面影响。我们仔细考察了这一影响,并提出了有效缓解负面作用的方法。最终,本研究提出了一种全新的GAN中Transformer设计方案——无卷积神经网络(CNN)的生成器STrans-G,该生成器在无条件及条件图像生成任务中均取得了具有竞争力的结果。此外,基于Transformer的判别器STrans-D也显著缩小了与基于CNN的判别器之间的性能差距。