Cross-Spatial Pixel Integration and Cross-Stage Feature Fusion Based Transformer Network for Remote Sensing Image Super-Resolution

Remote sensing image super-resolution (RSISR) plays a vital role in enhancing spatial detials and improving the quality of satellite imagery. Recently, Transformer-based models have shown competitive performance in RSISR. To mitigate the quadratic computational complexity resulting from global self-attention, various methods constrain attention to a local window, enhancing its efficiency. Consequently, the receptive fields in a single attention layer are inadequate, leading to insufficient context modeling. Furthermore, while most transform-based approaches reuse shallow features through skip connections, relying solely on these connections treats shallow and deep features equally, impeding the model's ability to characterize them. To address these issues, we propose a novel transformer architecture called Cross-Spatial Pixel Integration and Cross-Stage Feature Fusion Based Transformer Network (SPIFFNet) for RSISR. Our proposed model effectively enhances global cognition and understanding of the entire image, facilitating efficient integration of features cross-stages. The model incorporates cross-spatial pixel integration attention (CSPIA) to introduce contextual information into a local window, while cross-stage feature fusion attention (CSFFA) adaptively fuses features from the previous stage to improve feature expression in line with the requirements of the current stage. We conducted comprehensive experiments on multiple benchmark datasets, demonstrating the superior performance of our proposed SPIFFNet in terms of both quantitative metrics and visual quality when compared to state-of-the-art methods.

翻译：遥感图像超分辨率（RSISR）在提升卫星图像空间细节和改善图像质量方面具有重要作用。近年来，基于Transformer的模型在RSISR中展现出具有竞争力的性能。为缓解全局自注意力机制带来的二次计算复杂度，多种方法将注意力限制在局部窗口内，从而提升其效率。然而，这导致单层注意力层的感受野不足，造成上下文建模不充分。此外，尽管多数基于Transformer的方法通过跳跃连接复用浅层特征，但仅依赖此类连接会平等对待浅层与深层特征，阻碍模型对两类特征的表征能力。针对这些问题，我们提出了一种新颖的Transformer架构——名为基于跨空间像素整合与跨阶段特征融合的Transformer网络（SPIFFNet）用于RSISR。所提模型有效增强了整幅图像的全局感知与理解能力，促进了跨阶段特征的高效整合。该模型通过跨空间像素整合注意力（CSPIA）将上下文信息引入局部窗口，同时利用跨阶段特征融合注意力（CSFFA）自适应融合前一阶段特征，以提升当前阶段所需的特征表达能力。我们在多个基准数据集上开展了全面实验，结果表明，相较现有最先进方法，所提出的SPIFFNet在定量指标和视觉质量两方面均展现出更优性能。