In this paper, we propose HiTSR, a hierarchical transformer model for reference-based image super-resolution, which enhances low-resolution input images by learning matching correspondences from high-resolution reference images. Diverging from existing multi-network, multi-stage approaches, we streamline the architecture and training pipeline by incorporating the double attention block from GAN literature. Processing two visual streams independently, we fuse self-attention and cross-attention blocks through a gating attention strategy. The model integrates a squeeze-and-excitation module to capture global context from the input images, facilitating long-range spatial interactions within window-based attention blocks. Long skip connections between shallow and deep layers further enhance information flow. Our model demonstrates superior performance across three datasets including SUN80, Urban100, and Manga109. Specifically, on the SUN80 dataset, our model achieves PSNR/SSIM values of 30.24/0.821. These results underscore the effectiveness of attention mechanisms in reference-based image super-resolution. The transformer-based model attains state-of-the-art results without the need for purpose-built subnetworks, knowledge distillation, or multi-stage training, emphasizing the potency of attention in meeting reference-based image super-resolution requirements.
翻译:本文提出HiTSR,一种用于基于参考的图像超分辨率的分层Transformer模型,该模型通过学习高分辨率参考图像的匹配对应关系来增强低分辨率输入图像。与现有的多网络、多阶段方法不同,我们通过引入GAN文献中的双重注意力模块,简化了架构和训练流程。通过独立处理两个视觉流,我们采用门控注意力策略融合自注意力与交叉注意力模块。该模型集成了挤压-激励模块以捕获输入图像的全局上下文,促进基于窗口的注意力模块内的长程空间交互。浅层与深层之间的长跳跃连接进一步增强了信息流动。我们的模型在包括SUN80、Urban100和Manga109在内的三个数据集上均表现出优越性能。具体而言,在SUN80数据集上,我们的模型取得了PSNR/SSIM值为30.24/0.821的结果。这些结果凸显了注意力机制在基于参考的图像超分辨率中的有效性。该基于Transformer的模型无需专用子网络、知识蒸馏或多阶段训练即可达到最先进水平,充分证明了注意力机制在满足基于参考的图像超分辨率需求方面的强大能力。