General-purpose super-resolution models, particularly Vision Transformers, have achieved remarkable success but exhibit fundamental inefficiencies in common infrared imaging scenarios like surveillance and autonomous driving, which operate from fixed or nearly-static viewpoints. These models fail to exploit the strong, persistent spatial priors inherent in such scenes, leading to redundant learning and suboptimal performance. To address this, we propose the Regional Prior attention Transformer for infrared image Super-Resolution (RPT-SR), a novel architecture that explicitly encodes scene layout information into the attention mechanism. Our core contribution is a dual-token framework that fuses (1) learnable, regional prior tokens, which act as a persistent memory for the scene's global structure, with (2) local tokens that capture the frame-specific content of the current input. By utilizing these tokens into an attention, our model allows the priors to dynamically modulate the local reconstruction process. Extensive experiments validate our approach. While most prior works focus on a single infrared band, we demonstrate the broad applicability and versatility of RPT-SR by establishing new state-of-the-art performance across diverse datasets covering both Long-Wave (LWIR) and Short-Wave (SWIR) spectra
翻译:通用超分辨率模型,特别是Vision Transformer,已取得显著成功,但在诸如监控和自动驾驶等固定或准静态视点的常见红外成像场景中,却表现出固有的低效性。这些模型未能充分利用此类场景中固有的强且持续的空间先验信息,导致冗余学习和次优性能。为解决此问题,我们提出了一种用于红外图像超分辨率的区域先验注意力Transformer(RPT-SR),这是一种新颖的架构,它将场景布局信息显式编码到注意力机制中。我们的核心贡献是一个双令牌框架,该框架融合了(1)可学习的区域先验令牌(作为场景全局结构的持久记忆)与(2)捕获当前输入帧特定内容的局部令牌。通过将这些令牌整合到注意力计算中,我们的模型使得先验信息能够动态调制局部重建过程。大量实验验证了我们的方法。虽然大多数先前工作专注于单一红外波段,但我们通过在涵盖长波(LWIR)和短波(SWIR)光谱的多种数据集上建立新的最先进性能,证明了RPT-SR的广泛适用性和多功能性。