Hyperspectral object tracking (HOT) has exhibited potential in various applications, particularly in scenes where objects are camouflaged. Existing trackers can effectively retrieve objects via band regrouping because of the bias in existing HOT datasets, where most objects tend to have distinguishing visual appearances rather than spectral characteristics. This bias allows the tracker to directly use the visual features obtained from the false-color images generated by hyperspectral images without the need to extract spectral features. To tackle this bias, we find that the tracker should focus on the spectral information when object appearance is unreliable. Thus, we provide a new task called hyperspectral camouflaged object tracking (HCOT) and meticulously construct a large-scale HCOT dataset, termed BihoT, which consists of 41,912 hyperspectral images covering 49 video sequences. The dataset covers various artificial camouflage scenes where objects have similar appearances, diverse spectrums, and frequent occlusion, making it a very challenging dataset for HCOT. Besides, a simple but effective baseline model, named spectral prompt-based distractor-aware network (SPDAN), is proposed, comprising a spectral embedding network (SEN), a spectral prompt-based backbone network (SPBN), and a distractor-aware module (DAM). Specifically, the SEN extracts spectral-spatial features via 3-D and 2-D convolutions. Then, the SPBN fine-tunes powerful RGB trackers with spectral prompts and alleviates the insufficiency of training samples. Moreover, the DAM utilizes a novel statistic to capture the distractor caused by occlusion from objects and background. Extensive experiments demonstrate that our proposed SPDAN achieves state-of-the-art performance on the proposed BihoT and other HOT datasets.
翻译:高光谱目标跟踪(HOT)在各种应用中展现出潜力,尤其是在目标被伪装的场景中。由于现有HOT数据集的偏差,其中大多数目标往往具有区分性的视觉外观而非光谱特征,现有跟踪器能够通过波段重组有效检索目标。这种偏差使得跟踪器可以直接使用由高光谱图像生成的伪彩色图像中获得的视觉特征,而无需提取光谱特征。为解决此偏差,我们发现当目标外观不可靠时,跟踪器应聚焦于光谱信息。因此,我们提出了一个名为高光谱伪装目标跟踪(HCOT)的新任务,并精心构建了一个大规模HCOT数据集,命名为BihoT。该数据集包含涵盖49个视频序列的41,912幅高光谱图像,覆盖了多种人工伪装场景,其中目标具有相似的外观、多样的光谱以及频繁的遮挡,使其成为HCOT领域极具挑战性的数据集。此外,我们提出了一种简单而有效的基线模型,称为基于光谱提示的干扰物感知网络(SPDAN),它由光谱嵌入网络(SEN)、基于光谱提示的主干网络(SPBN)和干扰物感知模块(DAM)组成。具体而言,SEN通过3维和2维卷积提取光谱-空间特征。随后,SPBN利用光谱提示对强大的RGB跟踪器进行微调,并缓解训练样本不足的问题。此外,DAM采用一种新颖的统计量来捕获由目标和背景遮挡引起的干扰物。大量实验表明,我们提出的SPDAN在所提出的BihoT及其他HOT数据集上实现了最先进的性能。