Generative Adversarial Networks (GANs) have been very successful for synthesizing the images in a given dataset. The artificially generated images by GANs are very realistic. The GANs have shown potential usability in several computer vision applications, including image generation, image-to-image translation, video synthesis, and others. Conventionally, the generator network is the backbone of GANs, which generates the samples and the discriminator network is used to facilitate the training of the generator network. The discriminator network is usually a Convolutional Neural Network (CNN). Whereas, the generator network is usually either an Up-CNN for image generation or an Encoder-Decoder network for image-to-image translation. The convolution-based networks exploit the local relationship in a layer, which requires the deep networks to extract the abstract features. Hence, CNNs suffer to exploit the global relationship in the feature space. However, recently developed Transformer networks are able to exploit the global relationship at every layer. The Transformer networks have shown tremendous performance improvement for several problems in computer vision. Motivated from the success of Transformer networks and GANs, recent works have tried to exploit the Transformers in GAN framework for the image/video synthesis. This paper presents a comprehensive survey on the developments and advancements in GANs utilizing the Transformer networks for computer vision applications. The performance comparison for several applications on benchmark datasets is also performed and analyzed. The conducted survey will be very useful to deep learning and computer vision community to understand the research trends \& gaps related with Transformer-based GANs and to develop the advanced GAN architectures by exploiting the global and local relationships for different applications.
翻译:生成对抗网络(GANs)在合成给定数据集的图像方面取得了巨大成功。由GANs人工生成的图像非常逼真。GANs在多种计算机视觉应用中展现出潜在实用性,包括图像生成、图像到图像翻译、视频合成等。传统上,生成器网络是GANs的核心,负责生成样本,而判别器网络则用于辅助生成器网络的训练。判别器网络通常是卷积神经网络(CNN),而生成器网络通常是用于图像生成的上采样CNN或用于图像到图像翻译的编码器-解码器网络。基于卷积的网络利用层内的局部关系,这需要深层网络来提取抽象特征。因此,CNN在利用特征空间中的全局关系方面存在困难。然而,最近发展的Transformer网络能够在每一层利用全局关系。Transformer网络已在多个计算机视觉问题上展现出显著的性能提升。受Transformer网络和GANs成功的启发,近年来的研究尝试在GAN框架中利用Transformer进行图像/视频合成。本文对利用Transformer网络的GANs在计算机视觉应用中的发展和进步进行了全面综述。同时还对多个基准数据集上的应用性能进行了比较和分析。本综述将对深度学习和计算机视觉领域的研究人员极具价值,有助于他们理解基于Transformer的GANs的相关研究趋势与空白,并开发能够针对不同应用利用全局和局部关系的高级GAN架构。