The challenge of image generation has been effectively modeled as a problem of structure priors or transformation. However, existing models have unsatisfactory performance in understanding the global input image structures because of particular inherent features (for example, local inductive prior). Recent studies have shown that self-attention is an efficient modeling technique for image completion problems. In this paper, we propose a new architecture that relies on Distance-based Weighted Transformer (DWT) to better understand the relationships between an image's components. In our model, we leverage the strengths of both Convolutional Neural Networks (CNNs) and DWT blocks to enhance the image completion process. Specifically, CNNs are used to augment the local texture information of coarse priors and DWT blocks are used to recover certain coarse textures and coherent visual structures. Unlike current approaches that generally use CNNs to create feature maps, we use the DWT to encode global dependencies and compute distance-based weighted feature maps, which substantially minimizes the problem of visual ambiguities. Meanwhile, to better produce repeated textures, we introduce Residual Fast Fourier Convolution (Res-FFC) blocks to combine the encoder's skip features with the coarse features provided by our generator. Furthermore, a simple yet effective technique is proposed to normalize the non-zero values of convolutions, and fine-tune the network layers for regularization of the gradient norms to provide an efficient training stabiliser. Extensive quantitative and qualitative experiments on three challenging datasets demonstrate the superiority of our proposed model compared to existing approaches.
翻译:图像生成的挑战已被有效建模为结构先验或变换问题。然而,现有模型因特定固有特征(例如局部归纳先验)而在理解全局输入图像结构方面表现不佳。近期研究表明,自注意力机制是解决图像补全问题的高效建模技术。本文提出一种基于距离加权Transformer(DWT)的新架构,以更好地理解图像组件之间的关系。在我们的模型中,我们同时利用卷积神经网络(CNN)和DWT模块的优势来增强图像补全过程。具体而言,CNN用于增强粗先验的局部纹理信息,DWT模块则用于恢复特定粗纹理和连贯视觉结构。不同于当前普遍使用CNN生成特征图的方法,我们采用DWT编码全局依赖关系并计算基于距离的加权特征图,这显著缓解了视觉模糊性问题。同时,为更好生成重复纹理,我们引入残差快速傅里叶卷积(Res-FFC)模块,将编码器的跳跃特征与生成器提供的粗特征相结合。此外,本文提出一种简单有效的技术对卷积的非零值进行归一化,并通过微调网络层来正则化梯度范数,从而提供高效的训练稳定性。在三个具有挑战性的数据集上进行的大量定量和定性实验表明,相较于现有方法,我们提出的模型具有显著优越性。