Recently, Transformer architecture has been introduced into image restoration to replace convolution neural network (CNN) with surprising results. Considering the high computational complexity of Transformer with global attention, some methods use the local square window to limit the scope of self-attention. However, these methods lack direct interaction among different windows, which limits the establishment of long-range dependencies. To address the above issue, we propose a new image restoration model, Cross Aggregation Transformer (CAT). The core of our CAT is the Rectangle-Window Self-Attention (Rwin-SA), which utilizes horizontal and vertical rectangle window attention in different heads parallelly to expand the attention area and aggregate the features cross different windows. We also introduce the Axial-Shift operation for different window interactions. Furthermore, we propose the Locality Complementary Module to complement the self-attention mechanism, which incorporates the inductive bias of CNN (e.g., translation invariance and locality) into Transformer, enabling global-local coupling. Extensive experiments demonstrate that our CAT outperforms recent state-of-the-art methods on several image restoration applications. The code and models are available at https://github.com/zhengchen1999/CAT.
翻译:近年来,Transformer 架构被引入图像恢复领域以替代卷积神经网络(CNN),并取得了令人惊讶的效果。考虑到具有全局注意力的 Transformer 计算复杂度高,一些方法采用局部方形窗口来限制自注意力的范围。然而,这些方法缺乏不同窗口间的直接交互,限制了长距离依赖关系的建立。为解决上述问题,我们提出一种新的图像恢复模型——交叉聚合 Transformer(CAT)。CAT 的核心是矩形窗口自注意力(Rwin-SA),该机制在不同头中并行使用水平和垂直矩形窗口注意力,以扩展注意力区域并聚合跨窗口特征。我们还引入了轴向移位操作以实现不同窗口间的交互。此外,我们提出局部互补模块来补充自注意力机制,该模块将 CNN 的归纳偏置(如平移不变性和局部性)融入 Transformer,实现全局-局部耦合。大量实验表明,我们的 CAT 在多个图像恢复应用中优于近期最先进的方法。代码和模型已开源在 https://github.com/zhengchen1999/CAT。