MaxSR: Image Super-Resolution Using Improved MaxViT

While transformer models have been demonstrated to be effective for natural language processing tasks and high-level vision tasks, only a few attempts have been made to use powerful transformer models for single image super-resolution. Because transformer models have powerful representation capacity and the in-built self-attention mechanisms in transformer models help to leverage self-similarity prior in input low-resolution image to improve performance for single image super-resolution, we present a single image super-resolution model based on recent hybrid vision transformer of MaxViT, named as MaxSR. MaxSR consists of four parts, a shallow feature extraction block, multiple cascaded adaptive MaxViT blocks to extract deep hierarchical features and model global self-similarity from low-level features efficiently, a hierarchical feature fusion block, and finally a reconstruction block. The key component of MaxSR, i.e., adaptive MaxViT block, is based on MaxViT block which mixes MBConv with squeeze-and-excitation, block attention and grid attention. In order to achieve better global modelling of self-similarity in input low-resolution image, we improve block attention and grid attention in MaxViT block to adaptive block attention and adaptive grid attention which do self-attention inside each window across all grids and each grid across all windows respectively in the most efficient way. We instantiate proposed model for classical single image super-resolution (MaxSR) and lightweight single image super-resolution (MaxSR-light). Experiments show that our MaxSR and MaxSR-light establish new state-of-the-art performance efficiently.

翻译：尽管Transformer模型已被证明在自然语言处理任务和高级视觉任务中有效，但仅有少数尝试将强大的Transformer模型用于单图像超分辨率。由于Transformer模型具有强大的表示能力，且其内置的自注意力机制有助于利用输入低分辨率图像中的自相似性先验来提升单图像超分辨率性能，我们提出了一种基于最新混合视觉Transformer——MaxViT的单图像超分辨率模型，命名为MaxSR。MaxSR由四个部分组成：浅层特征提取块、多个级联的自适应MaxViT块（用于高效提取深层层次特征并建模低层特征的全局自相似性）、层次特征融合块以及重建块。MaxSR的关键组件即自适应MaxViT块，基于融合了MBConv与挤压激励、块注意力和网格注意力的MaxViT块。为实现对输入低分辨率图像中自相似性的更优全局建模，我们将MaxViT块中的块注意力和网格注意力改进为自适应块注意力和自适应网格注意力，它们分别以最有效的方式在每个窗口内跨所有网格执行自注意力，以及在每个网格内跨所有窗口执行自注意力。我们将所提模型实例化为经典单图像超分辨率（MaxSR）和轻量级单图像超分辨率（MaxSR-light）。实验表明，我们的MaxSR和MaxSR-light高效地建立了新的最先进性能。