Transformer RGBT Tracking with Spatio-Temporal Multimodal Tokens

Many RGBT tracking researches primarily focus on modal fusion design, while overlooking the effective handling of target appearance changes. While some approaches have introduced historical frames or fuse and replace initial templates to incorporate temporal information, they have the risk of disrupting the original target appearance and accumulating errors over time. To alleviate these limitations, we propose a novel Transformer RGBT tracking approach, which mixes spatio-temporal multimodal tokens from the static multimodal templates and multimodal search regions in Transformer to handle target appearance changes, for robust RGBT tracking. We introduce independent dynamic template tokens to interact with the search region, embedding temporal information to address appearance changes, while also retaining the involvement of the initial static template tokens in the joint feature extraction process to ensure the preservation of the original reliable target appearance information that prevent deviations from the target appearance caused by traditional temporal updates. We also use attention mechanisms to enhance the target features of multimodal template tokens by incorporating supplementary modal cues, and make the multimodal search region tokens interact with multimodal dynamic template tokens via attention mechanisms, which facilitates the conveyance of multimodal-enhanced target change information. Our module is inserted into the transformer backbone network and inherits joint feature extraction, search-template matching, and cross-modal interaction. Extensive experiments on three RGBT benchmark datasets show that the proposed approach maintains competitive performance compared to other state-of-the-art tracking algorithms while running at 39.1 FPS.

翻译：许多RGBT跟踪研究主要关注模态融合设计，而忽略了目标外观变化的有效处理。尽管部分方法引入历史帧或通过融合替换初始模板来融入时序信息，但这可能破坏原始目标外观并随时间累积误差。为缓解这些局限，我们提出一种新颖的基于Transformer的RGBT跟踪方法，通过将静态多模态模板与多模态搜索区域的时空多模态标记在Transformer中进行混合，以应对目标外观变化，实现鲁棒的RGBT跟踪。我们引入独立的动态模板标记与搜索区域交互，嵌入时序信息以处理外观变化，同时保留初始静态模板标记参与联合特征提取过程，确保原始可靠目标外观信息的完整性，从而避免传统时序更新导致的目标外观偏移。我们还利用注意力机制，通过补充模态线索增强多模态模板标记的目标特征，并令多模态搜索区域标记通过注意力机制与多模态动态模板标记交互，从而促进多模态增强目标变化信息的传递。本模块嵌入Transformer骨干网络，融合了联合特征提取、搜索-模板匹配与跨模态交互功能。在三个RGBT基准数据集上的大量实验表明，所提方法在保持39.1 FPS运行速度的同时，相较于其他先进跟踪算法具有竞争力。