While some studies have proven that Swin Transformer (Swin) with window self-attention (WSA) is suitable for single image super-resolution (SR), the plain WSA ignores the broad regions when reconstructing high-resolution images due to a limited receptive field. In addition, many deep learning SR methods suffer from intensive computations. To address these problems, we introduce the N-Gram context to the low-level vision with Transformers for the first time. We define N-Gram as neighboring local windows in Swin, which differs from text analysis that views N-Gram as consecutive characters or words. N-Grams interact with each other by sliding-WSA, expanding the regions seen to restore degraded pixels. Using the N-Gram context, we propose NGswin, an efficient SR network with SCDP bottleneck taking multi-scale outputs of the hierarchical encoder. Experimental results show that NGswin achieves competitive performance while maintaining an efficient structure when compared with previous leading methods. Moreover, we also improve other Swin-based SR methods with the N-Gram context, thereby building an enhanced model: SwinIR-NG. Our improved SwinIR-NG outperforms the current best lightweight SR approaches and establishes state-of-the-art results. Codes are available at https://github.com/rami0205/NGramSwin.
翻译:尽管已有研究证明,基于窗口自注意力(WSA)的Swin Transformer适用于单图像超分辨率(SR),但由于其有限的感受野,普通WSA在重建高分辨率图像时会忽略大范围区域。此外,许多基于深度学习的超分辨率方法面临计算量大的问题。为解决这些问题,我们首次将N-Gram上下文引入基于Transformer的低层视觉领域。与文本分析中将N-Gram视为连续字符或词语不同,我们将N-Gram定义为Swin中相邻的局部窗口。N-Gram通过滑动窗口自注意力(sliding-WSA)相互交互,从而扩展观测区域以恢复退化像素。利用N-Gram上下文,我们提出NGswin,一种高效的超分辨率网络,该网络采用SCDP瓶颈结构处理层次化编码器生成的多尺度输出。实验结果表明,与先前领先方法相比,NGswin在保持高效结构的同时实现了具有竞争力的性能。此外,我们还利用N-Gram上下文改进了其他基于Swin的超分辨率方法,构建了增强模型SwinIR-NG。改进后的SwinIR-NG超越了当前最优的轻量级超分辨率方法,并取得了最先进的结果。代码已开源至https://github.com/rami0205/NGramSwin。