Learning to Scale Temperature in Masked Self-Attention for Image Inpainting

Recent advances in deep generative adversarial networks (GAN) and self-attention mechanism have led to significant improvements in the challenging task of inpainting large missing regions in an image. These methods integrate self-attention mechanism in neural networks to utilize surrounding neural elements based on their correlation and help the networks capture long-range dependencies. Temperature is a parameter in the Softmax function used in the self-attention, and it enables biasing the distribution of attention scores towards a handful of similar patches. Most existing self-attention mechanisms in image inpainting are convolution-based and set the temperature as a constant, performing patch matching in a limited feature space. In this work, we analyze the artifacts and training problems in previous self-attention mechanisms, and redesign the temperature learning network as well as the self-attention mechanism to address them. We present an image inpainting framework with a multi-head temperature masked self-attention mechanism, which provides stable and efficient temperature learning and uses multiple distant contextual information for high quality image inpainting. In addition to improving image quality of inpainting results, we generalize the proposed model to user-guided image editing by introducing a new sketch generation method. Extensive experiments on various datasets such as Paris StreetView, CelebA-HQ and Places2 clearly demonstrate that our method not only generates more natural inpainting results than previous works both in terms of perception image quality and quantitative metrics, but also enables to help users to generate more flexible results that are related to their sketch guidance.

翻译：近年来，深度生成对抗网络（GAN）和自注意力机制的进展显著提升了图像中大面积缺失区域修复这一挑战性任务的效果。这些方法将自注意力机制整合到神经网络中，根据神经元之间的相关性利用周围神经元素，帮助网络捕捉长距离依赖关系。温度是自注意力机制中Softmax函数的一个参数，它能够将注意力分数的分布偏向少数相似补丁。现有的大多数图像修复自注意力机制基于卷积，并将温度设为常数，在有限的特征空间中进行补丁匹配。本文分析了以往自注意力机制中存在的伪影和训练问题，并重新设计了温度学习网络及自注意力机制以解决这些问题。我们提出了一种结合多头温度掩码自注意力机制的图像修复框架，该框架能够实现稳定高效的温度学习，并利用多个远距离上下文信息实现高质量图像修复。除了提升修复结果的图像质量外，我们通过引入一种新的草图生成方法，将所提模型推广到用户引导的图像编辑。在巴黎街景图、CelebA-HQ和Places2等多个数据集上的大量实验表明，我们的方法不仅在感知图像质量和量化指标上比以往工作生成更自然的修复结果，还能够帮助用户生成与草图引导相关的更灵活结果。