Transformer-based architectures start to emerge in single image super resolution (SISR) and have achieved promising performance. Most existing Vision Transformers divide images into the same number of patches with a fixed size, which may not be optimal for restoring patches with different levels of texture richness. This paper presents HIPA, a novel Transformer architecture that progressively recovers the high resolution image using a hierarchical patch partition. Specifically, we build a cascaded model that processes an input image in multiple stages, where we start with tokens with small patch sizes and gradually merge to the full resolution. Such a hierarchical patch mechanism not only explicitly enables feature aggregation at multiple resolutions but also adaptively learns patch-aware features for different image regions, e.g., using a smaller patch for areas with fine details and a larger patch for textureless regions. Meanwhile, a new attention-based position encoding scheme for Transformer is proposed to let the network focus on which tokens should be paid more attention by assigning different weights to different tokens, which is the first time to our best knowledge. Furthermore, we also propose a new multi-reception field attention module to enlarge the convolution reception field from different branches. The experimental results on several public datasets demonstrate the superior performance of the proposed HIPA over previous methods quantitatively and qualitatively.
翻译:基于Transformer的架构开始出现在单图像超分辨率(SISR)领域,并取得了令人瞩目的性能。现有的大多数视觉Transformer将图像划分为固定大小的相同数量分块,这对于恢复纹理丰富程度不同的分块可能并非最优方案。本文提出HIPA,一种新颖的Transformer架构,通过层次化分块分区逐步恢复高分辨率图像。具体而言,我们构建了一个级联模型,在多个阶段处理输入图像:从较小分块大小的令牌开始,逐步合并至全分辨率。这种层次化分块机制不仅显式地实现了多分辨率下的特征聚合,还能自适应学习不同图像区域的感知分块特征,例如对细节丰富区域使用较小分块,对无纹理区域使用较大分块。同时,本文提出了一种基于注意力机制的新型位置编码方案,通过为不同令牌分配不同权重,使网络聚焦于更应关注的令牌——据我们所知,这是首次实现。此外,我们还设计了一个新的多感受野注意力模块,用以从不同分支扩大卷积感受野。在多个公开数据集上的实验结果表明,所提出的HIPA在定量和定性评估上均优于先前方法。