Transformer has recently gained considerable popularity in low-level vision tasks, including image super-resolution (SR). These networks utilize self-attention along different dimensions, spatial or channel, and achieve impressive performance. This inspires us to combine the two dimensions in Transformer for a more powerful representation capability. Based on the above idea, we propose a novel Transformer model, Dual Aggregation Transformer (DAT), for image SR. Our DAT aggregates features across spatial and channel dimensions, in the inter-block and intra-block dual manner. Specifically, we alternately apply spatial and channel self-attention in consecutive Transformer blocks. The alternate strategy enables DAT to capture the global context and realize inter-block feature aggregation. Furthermore, we propose the adaptive interaction module (AIM) and the spatial-gate feed-forward network (SGFN) to achieve intra-block feature aggregation. AIM complements two self-attention mechanisms from corresponding dimensions. Meanwhile, SGFN introduces additional non-linear spatial information in the feed-forward network. Extensive experiments show that our DAT surpasses current methods. Code and models are obtainable at https://github.com/zhengchen1999/DAT.
翻译:Transformer最近在低级视觉任务(包括图像超分辨率)中获得了广泛关注。这些网络利用沿不同维度(空间或通道)的自注意力,并取得了令人瞩目的性能。这启发我们将Transformer中的这两个维度结合起来,以获得更强的表示能力。基于上述思路,我们提出了一种新颖的Transformer模型——双聚合Transformer(DAT),用于图像超分辨率。我们的DAT以跨块和块内双重方式,在空间和通道维度上聚合特征。具体而言,我们在连续的Transformer块中交替应用空间和通道自注意力。这种交替策略使DAT能够捕获全局上下文,并实现跨块特征聚合。此外,我们提出了自适应交互模块(AIM)和空间门控前馈网络(SGFN),以实现块内特征聚合。AIM从对应维度对两种自注意力机制进行互补。同时,SGFN在前馈网络中引入额外的非线性空间信息。大量实验表明,我们的DAT超越了当前方法。代码和模型可在https://github.com/zhengchen1999/DAT获取。