Learning Progressive Adaptation for Multi-Modal Tracking

Due to the limited availability of paired multi-modal data, multi-modal trackers are typically built by adopting pre-trained RGB models with parameter-efficient fine-tuning modules. However, these fine-tuning methods overlook advanced adaptations for applying RGB pre-trained models and fail to modulate a single specific modality, cross-modal interactions, and the prediction head. To address the issues, we propose to perform Progressive Adaptation for Multi-Modal Tracking (PATrack). This innovative approach incorporates modality-dependent, modality-entangled, and task-level adapters, effectively bridging the gap in adapting RGB pre-trained networks to multi-modal data through a progressive strategy. Specifically, modality-specific information is enhanced through the modality-dependent adapter, decomposing the high- and low-frequency components, which ensures a more robust feature representation within each modality. The inter-modal interactions are introduced in the modality-entangled adapter, which implements a cross-attention operation guided by inter-modal shared information, ensuring the reliability of features conveyed between modalities. Additionally, recognising that the strong inductive bias of the prediction head does not adapt to the fused information, a task-level adapter specific to the prediction head is introduced. In summary, our design integrates intra-modal, inter-modal, and task-level adapters into a unified framework. Extensive experiments on RGB+Thermal, RGB+Depth, and RGB+Event tracking tasks demonstrate that our method shows impressive performance against state-of-the-art methods. Code is available at https://github.com/ouha1998/Learning-Progressive-Adaptation-for-Multi-Modal-Tracking.

翻译：由于配对的多模态数据有限，多模态跟踪器通常采用预训练的RGB模型结合参数高效微调模块构建。然而，这些微调方法忽略了针对RGB预训练模型应用的进一步适配，且未能有效调节单个特定模态、跨模态交互及预测头。为解决这些问题，我们提出用于多模态跟踪的渐进式适配学习（PATrack）。该创新方法整合了模态依赖型、模态纠缠型和任务级适配器，通过渐进策略有效弥合了RGB预训练网络向多模态数据迁移的差距。具体而言，模态依赖型适配器通过分解高频与低频分量增强模态特异性信息，确保各模态内更鲁棒的特征表征。模态纠缠型适配器通过跨模态共享信息引导的交叉注意力操作引入模态间交互，保障跨模态传递特征的可靠性。此外，针对预测头强归纳偏置无法适配融合信息的问题，引入专用于预测头的任务级适配器。总体而言，本设计将模态内、模态间及任务级适配器统一集成于框架中。在RGB+热成像、RGB+深度及RGB+事件跟踪任务上的大量实验表明，本方法相较于现有最优方法展现出卓越性能。代码已开源至https://github.com/ouha1998/Learning-Progressive-Adaptation-for-Multi-Modal-Tracking。