Hyperspectral imagery contains abundant spectral information beyond the visible RGB bands, providing rich discriminative details about objects in a scene. Leveraging such data has the potential to enhance visual tracking performance. While prior hyperspectral trackers employ CNN or hybrid CNN-Transformer architectures, we propose a novel approach HPFormer on Transformers to capitalize on their powerful representation learning capabilities. The core of HPFormer is a Hyperspectral Hybrid Attention (HHA) module which unifies feature extraction and fusion within one component through token interactions. Additionally, a Transform Band Module (TBM) is introduced to selectively aggregate spatial details and spectral signatures from the full hyperspectral input for injecting informative target representations. Extensive experiments demonstrate state-of-the-art performance of HPFormer on benchmark NIR and VIS tracking datasets. Our work provides new insights into harnessing the strengths of transformers and hyperspectral fusion to advance robust object tracking.
翻译:高光谱图像包含超出可见光RGB波段的丰富光谱信息,能够提供场景中目标丰富的鉴别性细节。利用此类数据有望提升视觉跟踪性能。尽管先前的高光谱跟踪器采用CNN或CNN-Transformer混合架构,我们提出了一种基于Transformer的新型方法HPFormer,以充分利用其强大的表征学习能力。HPFormer的核心是光谱混合注意力(HHA)模块,该模块通过token交互将特征提取与融合统一在一个组件中。此外,引入变换波段模块(TBM),从全光谱输入中选择性地聚合空间细节与光谱特征,以注入信息丰富的目标表征。大量实验表明,HPFormer在基准近红外(NIR)和可见光(VIS)跟踪数据集上达到了最先进的性能。我们的工作为利用Transformer和高光谱融合的优势来推动稳健目标跟踪提供了新见解。