Recently, Transformer-based methods have achieved impressive results in single image super-resolution (SISR). However, the lack of locality mechanism and high complexity limit their application in the field of super-resolution (SR). To solve these problems, we propose a new method, Efficient Mixed Transformer (EMT) in this study. Specifically, we propose the Mixed Transformer Block (MTB), consisting of multiple consecutive transformer layers, in some of which the Pixel Mixer (PM) is used to replace the Self-Attention (SA). PM can enhance the local knowledge aggregation with pixel shifting operations. At the same time, no additional complexity is introduced as PM has no parameters and floating-point operations. Moreover, we employ striped window for SA (SWSA) to gain an efficient global dependency modelling by utilizing image anisotropy. Experimental results show that EMT outperforms the existing methods on benchmark dataset and achieved state-of-the-art performance. The Code is available at https://github. com/Fried-Rice-Lab/EMT.git.
翻译:最近,基于Transformer的方法在单图像超分辨率(SISR)中取得了令人瞩目的成果。然而,缺乏局部性机制和高复杂度限制了它们在超分辨率(SR)领域的应用。为解决这些问题,本研究提出了一种新方法——高效混合Transformer(Efficient Mixed Transformer, EMT)。具体而言,我们提出了混合Transformer块(Mixed Transformer Block, MTB),它由多个连续的Transformer层组成,其中部分层使用像素混合器(Pixel Mixer, PM)替代自注意力(Self-Attention, SA)。PM通过像素移位操作增强局部知识聚合,同时因无参数和浮点运算而不引入额外复杂度。此外,我们采用条带窗口自注意力(Striped Window SA, SWSA),利用图像各向异性实现高效的全局依赖建模。实验结果表明,EMT在基准数据集上优于现有方法,达到了最先进的性能。代码可在 https://github.com/Fried-Rice-Lab/EMT.git 获取。