In the realm of video object tracking, auxiliary modalities such as depth, thermal, or event data have emerged as valuable assets to complement the RGB trackers. In practice, most existing RGB trackers learn a single set of parameters to use them across datasets and applications. However, a similar single-model unification for multi-modality tracking presents several challenges. These challenges stem from the inherent heterogeneity of inputs -- each with modality-specific representations, the scarcity of multi-modal datasets, and the absence of all the modalities at all times. In this work, we introduce Un-Track, a \underline{Un}ified Tracker of a single set of parameters for any modality. To handle any modality, our method learns their common latent space through low-rank factorization and reconstruction techniques. More importantly, we use only the RGB-X pairs to learn the common latent space. This unique shared representation seamlessly binds all modalities together, enabling effective unification and accommodating any missing modality, all within a single transformer-based architecture and without the need for modality-specific fine-tuning. Our Un-Track achieves +8.1 absolute F-score gain, on the DepthTrack dataset, by introducing only +2.14 (over 21.50) GFLOPs with +6.6M (over 93M) parameters, through a simple yet efficient prompting strategy. Extensive comparisons on five benchmark datasets with different modalities show that Un-Track surpasses both SOTA unified trackers and modality-specific finetuned counterparts, validating our effectiveness and practicality.
翻译:在视频目标跟踪领域,深度、热成像或事件数据等辅助模态已成为补充RGB跟踪器的重要资源。实践中,大多数现有RGB跟踪器学习一组固定参数以跨数据集和应用场景使用。然而,针对多模态跟踪的类似单模型统一方案面临诸多挑战:输入固有的异质性(每种模态具有特定表征)、多模态数据集的稀缺性以及模态缺失的普遍性。本文提出Un-Track——一种基于单组参数的任意模态统一跟踪器。为处理任意模态,本方法通过低秩分解与重建技术学习其共同潜空间。更重要的是,我们仅利用RGB-辅助模态配对样本学习该共同潜空间。这种独特的共享表征能无缝绑定所有模态,在无需模态特定微调的前提下,通过单一Transformer架构实现有效统一并兼容任意模态缺失。通过简洁高效的提示策略,Un-Track在DepthTrack数据集上仅增加+2.14 GFLOPs(原21.50 GFLOPs)和+6.6M参数(原93M参数),就实现了+8.1绝对F-score提升。在五个不同模态基准数据集上的广泛对比表明,Un-Track在性能上超越现有最先进的统一跟踪器及模态特定微调方案,验证了本方法的有效性与实用性。