Breaking Modality Gap in RGBT Tracking: Coupled Knowledge Distillation

Modality gap between RGB and thermal infrared (TIR) images is a crucial issue but often overlooked in existing RGBT tracking methods. It can be observed that modality gap mainly lies in the image style difference. In this work, we propose a novel Coupled Knowledge Distillation framework called CKD, which pursues common styles of different modalities to break modality gap, for high performance RGBT tracking. In particular, we introduce two student networks and employ the style distillation loss to make their style features consistent as much as possible. Through alleviating the style difference of two student networks, we can break modality gap of different modalities well. However, the distillation of style features might harm to the content representations of two modalities in student networks. To handle this issue, we take original RGB and TIR networks as the teachers, and distill their content knowledge into two student networks respectively by the style-content orthogonal feature decoupling scheme. We couple the above two distillation processes in an online optimization framework to form new feature representations of RGB and thermal modalities without modality gap. In addition, we design a masked modeling strategy and a multi-modal candidate token elimination strategy into CKD to improve tracking robustness and efficiency respectively. Extensive experiments on five standard RGBT tracking datasets validate the effectiveness of the proposed method against state-of-the-art methods while achieving the fastest tracking speed of 96.4 FPS. Code available at https://github.com/Multi-Modality-Tracking/CKD.

翻译：RGB与热红外（TIR）图像间的模态鸿沟是现有RGBT跟踪方法中一个关键但常被忽视的问题。可以观察到，模态鸿沟主要源于图像风格差异。本文提出一种名为CKD的新型耦合知识蒸馏框架，通过寻求不同模态的共同风格来打破模态鸿沟，以实现高性能的RGBT跟踪。具体而言，我们引入两个学生网络，并采用风格蒸馏损失使其风格特征尽可能保持一致。通过减轻两个学生网络的风格差异，我们能有效打破不同模态间的模态鸿沟。然而，风格特征的蒸馏可能损害学生网络中两个模态的内容表征。为解决此问题，我们将原始RGB和TIR网络作为教师网络，通过风格-内容正交特征解耦方案，分别将二者的内容知识蒸馏至两个学生网络。我们将上述两个蒸馏过程耦合在在线优化框架中，形成无模态鸿沟的RGB与热模态新特征表征。此外，我们在CKD中设计了掩码建模策略与多模态候选令牌消除策略，分别提升跟踪鲁棒性与效率。在五个标准RGBT跟踪数据集上的大量实验表明，所提方法相较于最先进方法具有显著优势，同时实现了96.4 FPS的最快跟踪速度。代码发布于https://github.com/Multi-Modality-Tracking/CKD。