Recently, Vision Transformers (ViTs) have shown competitive performance on image recognition while requiring less vision-specific inductive biases. In this paper, we investigate if such performance can be extended to image generation. To this end, we integrate the ViT architecture into generative adversarial networks (GANs). For ViT discriminators, we observe that existing regularization methods for GANs interact poorly with self-attention, causing serious instability during training. To resolve this issue, we introduce several novel regularization techniques for training GANs with ViTs. For ViT generators, we examine architectural choices for latent and pixel mapping layers to facilitate convergence. Empirically, our approach, named ViTGAN, achieves comparable performance to the leading CNN-based GAN models on three datasets: CIFAR-10, CelebA, and LSUN bedroom.
翻译:近年来,视觉Transformer(ViT)在图像识别任务中展现出与卷积神经网络相竞争的性能,同时所需视觉特异性归纳偏置更少。本文旨在探究此类性能优势能否扩展至图像生成领域。为此,我们将ViT架构整合至生成对抗网络(GANs)框架中。针对ViT判别器,我们观察到现有GAN正则化方法与自注意力机制存在不良交互,导致训练过程严重失稳。为解决该问题,我们提出了若干适用于ViT-GAN训练的新型正则化技术。对于ViT生成器,我们研究了隐空间映射层与像素映射层的架构设计方案以促进模型收敛。实验表明,我们提出的ViTGAN方法在CIFAR-10、CelebA和LSUN bedroom三个数据集上取得了与主流基于CNN的GAN模型相当的性能表现。