Multi-modal object tracking has attracted considerable attention by integrating multiple complementary inputs (e.g., thermal, depth, and event data) to achieve outstanding performance. Although current general-purpose multi-modal trackers primarily unify various modal tracking tasks (i.e., RGB-Thermal infrared, RGB-Depth or RGB-Event tracking) through prompt learning, they still overlook the effective capture of spatio-temporal cues. In this work, we introduce a novel multi-modal tracking framework based on a mamba-style state space model, termed UBATrack. Our UBATrack comprises two simple yet effective modules: a Spatio-temporal Mamba Adapter (STMA) and a Dynamic Multi-modal Feature Mixer. The former leverages Mamba's long-sequence modeling capability to jointly model cross-modal dependencies and spatio-temporal visual cues in an adapter-tuning manner. The latter further enhances multi-modal representation capacity across multiple feature dimensions to improve tracking robustness. In this way, UBATrack eliminates the need for costly full-parameter fine-tuning, thereby improving the training efficiency of multi-modal tracking algorithms. Experiments show that UBATrack outperforms state-of-the-art methods on RGB-T, RGB-D, and RGB-E tracking benchmarks, achieving outstanding results on the LasHeR, RGBT234, RGBT210, DepthTrack, VOT-RGBD22, and VisEvent datasets.
翻译:多模态目标跟踪通过整合多种互补输入(如热成像、深度和事件数据)以实现卓越性能,已引起广泛关注。尽管当前通用多模态跟踪器主要通过提示学习统一多种模态跟踪任务(即RGB-热红外、RGB-深度或RGB-事件跟踪),但仍未能有效捕捉时空线索。本研究提出一种基于Mamba风格状态空间模型的新型多模态跟踪框架,命名为UBATrack。我们的UBATrack包含两个简洁而高效的模块:时空Mamba适配器(STMA)和动态多模态特征混合器。前者利用Mamba的长序列建模能力,以适配器调优方式联合建模跨模态依赖关系与时空视觉线索;后者进一步在多个特征维度增强多模态表征能力,从而提升跟踪鲁棒性。通过这种方式,UBATrack无需进行高成本的全参数微调,显著提高了多模态跟踪算法的训练效率。实验表明,UBATrack在RGB-T、RGB-D和RGB-E跟踪基准测试中均超越现有先进方法,在LasHeR、RGBT234、RGBT210、DepthTrack、VOT-RGBD22和VisEvent数据集上取得了优异结果。