Due to the rapid development of computer vision, single-modal (RGB) object tracking has made significant progress in recent years. Considering the limitation of single imaging sensor, multi-modal images (RGB, Infrared, etc.) are introduced to compensate for this deficiency for all-weather object tracking in complex environments. However, as acquiring sufficient multi-modal tracking data is hard while the dominant modality changes with the open environment, most existing techniques fail to extract multi-modal complementary information dynamically, yielding unsatisfactory tracking performance. To handle this problem, we propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter, cross-prompting multiple modalities mutually. Our model consists of a universal bi-directional adapter and multiple modality-specific transformer encoder branches with sharing parameters. The encoders extract features of each modality separately by using a frozen pre-trained foundation model. We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another, performing visual feature prompt fusion in an adaptive manner. With adding fewer (0.32M) trainable parameters, our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods. Our code is available: https://github.com/SparkTempest/BAT.
翻译:由于计算机视觉的快速发展,近年来单模态(可见光RGB)目标跟踪取得了显著进展。考虑到单一成像传感器的局限性,多模态图像(可见光、红外等)被引入以弥补这一缺陷,实现复杂环境下的全天候目标跟踪。然而,由于获取充足的多模态跟踪数据较为困难,且主导模态会随开放环境动态变化,现有技术大多无法动态提取多模态互补信息,导致跟踪性能不理想。为解决该问题,我们提出了一种基于通用双向适配器的新型多模态视觉提示跟踪模型,实现多模态间的交叉提示。该模型由通用双向适配器和多个共享参数的模态专用Transformer编码器分支构成。编码器通过使用冻结的预训练基础模型分别提取各模态特征。我们开发了一种简单而高效的轻量特征适配器,将模态特定信息从一种模态传递至另一种模态,以自适应方式执行视觉特征提示融合。通过仅增加少量(0.32M)可训练参数,我们的模型相比全微调方法和基于提示学习的方法均取得了更优的跟踪性能。代码已开源:https://github.com/SparkTempest/BAT。