All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment

Current mainstream vision-language (VL) tracking framework consists of three parts, \ie a visual feature extractor, a language feature extractor, and a fusion model. To pursue better performance, a natural modus operandi for VL tracking is employing customized and heavier unimodal encoders, and multi-modal fusion models. Albeit effective, existing VL trackers separate feature extraction and feature integration, resulting in extracted features that lack semantic guidance and have limited target-aware capability in complex scenarios, \eg similar distractors and extreme illumination. In this work, inspired by the recent success of exploring foundation models with unified architecture for both natural language and computer vision tasks, we propose an All-in-One framework, which learns joint feature extraction and interaction by adopting a unified transformer backbone. Specifically, we mix raw vision and language signals to generate language-injected vision tokens, which we then concatenate before feeding into the unified backbone architecture. This approach achieves feature integration in a unified backbone, removing the need for carefully-designed fusion modules and resulting in a more effective and efficient VL tracking framework. To further improve the learning efficiency, we introduce a multi-modal alignment module based on cross-modal and intra-modal contrastive objectives, providing more reasonable representations for the unified All-in-One transformer backbone. Extensive experiments on five benchmarks, \ie OTB99-L, TNL2K, LaSOT, LaSOT$_{\rm Ext}$ and WebUAV-3M, demonstrate the superiority of the proposed tracker against existing state-of-the-arts on VL tracking. Codes will be made publicly available.

翻译：当前主流的视觉-语言（VL）跟踪框架由三部分组成，即视觉特征提取器、语言特征提取器和融合模型。为追求更优性能，VL跟踪的常规做法是采用定制化且更重的单模态编码器和多模态融合模型。尽管有效，但现有VL跟踪器将特征提取与特征集成分离，导致提取的特征缺乏语义引导，在复杂场景（如相似干扰物和极端光照）中目标感知能力有限。受近期探索统一架构基础模型用于自然语言和计算机视觉任务的成功启发，本文提出一种"一体多能"框架，通过采用统一Transformer主干实现联合特征提取与交互。具体而言，我们将原始视觉与语言信号混合生成语言注入的视觉令牌，随后将这些令牌拼接后输入统一主干架构。该方法在统一主干中实现特征集成，无需精心设计的融合模块，从而构建更高效、更有效的VL跟踪框架。为进一步提升学习效率，我们引入基于跨模态与模态内对比目标的多模态对齐模块，为统一的"一体多能"Transformer主干提供更合理的表征。在OTB99-L、TNL2K、LaSOT、LaSOT$_{\rm Ext}$和WebUAV-3M五个基准上的大量实验表明，所提跟踪器在VL跟踪中优于现有最先进方法。代码将公开发布。