Recently, Transformer-based methods have achieved impressive results in single image super-resolution (SISR). However, the lack of locality mechanism and high complexity limit their application in the field of super-resolution (SR). To solve these problems, we propose a new method, Efficient Mixed Transformer (EMT) in this study. Specifically, we propose the Mixed Transformer Block (MTB), consisting of multiple consecutive transformer layers, in some of which the Pixel Mixer (PM) is used to replace the Self-Attention (SA). PM can enhance the local knowledge aggregation with pixel shifting operations. At the same time, no additional complexity is introduced as PM has no parameters and floating-point operations. Moreover, we employ striped window for SA (SWSA) to gain an efficient global dependency modelling by utilizing image anisotropy. Experimental results show that EMT outperforms the existing methods on benchmark dataset and achieved state-of-the-art performance. The Code is available at https://github.com/Fried-Rice-Lab/FriedRiceLab.
翻译:最近,基于Transformer的方法在单图像超分辨率(SISR)领域取得了令人瞩目的成果。然而,缺乏局部性机制和高复杂度限制了其在超分辨率(SR)领域的应用。为解决这些问题,本文提出了一种新方法——高效混合Transformer(EMT)。具体而言,我们提出了混合Transformer块(MTB),由多个连续的Transformer层组成,其中部分层采用像素混合器(PM)替代自注意力机制(SA)。PM通过像素移位操作增强局部知识聚合,同时因其不含参数和浮点运算,不会引入额外复杂度。此外,我们为自注意力机制采用条带窗口(SWSA),利用图像各向异性实现高效的全局依赖建模。实验结果表明,EMT在基准数据集上优于现有方法,取得了最先进性能。代码开源于https://github.com/Fried-Rice-Lab/FriedRiceLab。