Foundation models (FMs) are revolutionizing the analysis and understanding of remote sensing (RS) scenes, including aerial RGB, multispectral, and SAR images. However, hyperspectral images (HSIs), which are rich in spectral information, have not seen much application of FMs, with existing methods often restricted to specific tasks and lacking generality. To fill this gap, we introduce HyperSIGMA, a vision transformer-based foundation model for HSI interpretation, scalable to over a billion parameters. To tackle the spectral and spatial redundancy challenges in HSIs, we introduce a novel sparse sampling attention (SSA) mechanism, which effectively promotes the learning of diverse contextual features and serves as the basic block of HyperSIGMA. HyperSIGMA integrates spatial and spectral features using a specially designed spectral enhancement module. In addition, we construct a large-scale hyperspectral dataset, HyperGlobal-450K, for pre-training, which contains about 450K hyperspectral images, significantly surpassing existing datasets in scale. Extensive experiments on various high-level and low-level HSI tasks demonstrate HyperSIGMA's versatility and superior representational capability compared to current state-of-the-art methods. Moreover, HyperSIGMA shows significant advantages in scalability, robustness, cross-modal transferring capability, and real-world applicability.
翻译:基础模型(FMs)正在彻底改变对遥感(RS)场景(包括航空RGB、多光谱和SAR图像)的分析与理解。然而,富含光谱信息的高光谱图像(HSIs)尚未得到基础模型的广泛应用,现有方法通常局限于特定任务且缺乏通用性。为填补这一空白,我们提出了HyperSIGMA,一个基于视觉Transformer的高光谱图像解译基础模型,其参数规模可扩展至数十亿。为应对高光谱图像中存在的谱间与空间冗余挑战,我们引入了一种新颖的稀疏采样注意力(SSA)机制,该机制有效促进了多样化上下文特征的学习,并作为HyperSIGMA的基本构建模块。HyperSIGMA通过专门设计的光谱增强模块整合空间与光谱特征。此外,我们构建了一个大规模高光谱数据集HyperGlobal-450K用于预训练,该数据集包含约45万张高光谱图像,在规模上显著超越了现有数据集。在各种高级与低级高光谱任务上的大量实验表明,与当前最先进方法相比,HyperSIGMA具有卓越的通用性和表征能力。此外,HyperSIGMA在可扩展性、鲁棒性、跨模态迁移能力以及实际应用性方面均展现出显著优势。