Benefiting from powerful convolutional neural networks (CNNs), learning-based image inpainting methods have made significant breakthroughs over the years. However, some nature of CNNs (e.g. local prior, spatially shared parameters) limit the performance in the face of broken images with diverse and complex forms. Recently, a class of attention-based network architectures, called transformer, has shown significant performance on natural language processing fields and high-level vision tasks. Compared with CNNs, attention operators are better at long-range modeling and have dynamic weights, but their computational complexity is quadratic in spatial resolution, and thus less suitable for applications involving higher resolution images, such as image inpainting. In this paper, we design a novel attention linearly related to the resolution according to Taylor expansion. And based on this attention, a network called $T$-former is designed for image inpainting. Experiments on several benchmark datasets demonstrate that our proposed method achieves state-of-the-art accuracy while maintaining a relatively low number of parameters and computational complexity. The code can be found at \href{https://github.com/dengyecode/T-former_image_inpainting}{github.com/dengyecode/T-former\_image\_inpainting}
翻译:受益于强大的卷积神经网络,基于学习的图像修复方法近年来取得了显著突破。然而,卷积神经网络的某些特性(如局部先验、空间共享参数)限制了其在面对形态复杂多样的破损图像时的性能。近期,一类基于注意力机制的网络架构——Transformer——在自然语言处理领域和高级视觉任务中展现出卓越表现。与卷积神经网络相比,注意力算子更擅长长程建模且具有动态权重,但其计算复杂度与空间分辨率呈二次关系,因此难以适用于图像修复等涉及高分辨率图像的应用。本文根据泰勒展开设计了一种与分辨率呈线性关系的新型注意力机制,并基于该注意力机制构建了名为T-former的网络用于图像修复。在多个基准数据集上的实验表明,所提出的方法在保持较低参数量和计算复杂度的同时,达到了最先进的精度。代码详见:github.com/dengyecode/T-former_image_inpainting