RGB-T tracking involves the use of images from both visible and thermal modalities. The primary objective is to adaptively leverage the relatively dominant modality in varying conditions to achieve more robust tracking compared to single-modality tracking. An RGB-T tracker based on mixed attention mechanism to achieve complementary fusion of modalities (referred to as MACFT) is proposed in this paper. In the feature extraction stage, we utilize different transformer backbone branches to extract specific and shared information from different modalities. By performing mixed attention operations in the backbone to enable information interaction and self-enhancement between the template and search images, it constructs a robust feature representation that better understands the high-level semantic features of the target. Then, in the feature fusion stage, a modality-adaptive fusion is achieved through a mixed attention-based modality fusion network, which suppresses the low-quality modality noise while enhancing the information of the dominant modality. Evaluation on multiple RGB-T public datasets demonstrates that our proposed tracker outperforms other RGB-T trackers on general evaluation metrics while also being able to adapt to longterm tracking scenarios.
翻译:RGB-T目标跟踪涉及使用可见光和热红外两种模态的图像。其主要目标是在不同条件下自适应地利用相对优势的模态,以实现比单模态跟踪更稳健的跟踪性能。本文提出了一种基于混合注意力机制实现模态互补融合的RGB-T跟踪器(简称MACFT)。在特征提取阶段,我们利用不同的Transformer主干分支从不同模态中提取特定信息和共享信息。通过在主干网络中进行混合注意力操作,使模板图像与搜索图像之间实现信息交互与自增强,构建了对目标高层语义特征具有更强理解能力的鲁棒特征表示。随后在特征融合阶段,通过基于混合注意力的模态融合网络实现模态自适应融合,在增强优势模态信息的同时抑制低质量模态噪声。在多个RGB-T公开数据集上的评估结果表明,本文提出的跟踪器不仅在通用评价指标上优于其他RGB-T跟踪器,还能适应长期跟踪场景。