Single object tracking aims to locate one specific target in video sequences, given its initial state. Classical trackers rely solely on visual cues, restricting their ability to handle challenges such as appearance variations, ambiguity, and distractions. Hence, Vision-Language (VL) tracking has emerged as a promising approach, incorporating language descriptions to directly provide high-level semantics and enhance tracking performance. However, current VL trackers have not fully exploited the power of VL learning, as they suffer from limitations such as heavily relying on off-the-shelf backbones for feature extraction, ineffective VL fusion designs, and the absence of VL-related loss functions. Consequently, we present a novel tracker that progressively explores target-centric semantics for VL tracking. Specifically, we propose the first Synchronous Learning Backbone (SLB) for VL tracking, which consists of two novel modules: the Target Enhance Module (TEM) and the Semantic Aware Module (SAM). These modules enable the tracker to perceive target-related semantics and comprehend the context of both visual and textual modalities at the same pace, facilitating VL feature extraction and fusion at different semantic levels. Moreover, we devise the dense matching loss to further strengthen multi-modal representation learning. Extensive experiments on VL tracking datasets demonstrate the superiority and effectiveness of our methods.
翻译:单目标跟踪旨在根据给定初始状态在视频序列中定位特定目标。经典跟踪器仅依赖视觉线索,限制了其应对外观变化、模糊性和干扰等挑战的能力。因此,视觉-语言跟踪作为一种有前景的方法应运而生,通过引入语言描述直接提供高层语义信息来增强跟踪性能。然而,当前视觉-语言跟踪器尚未充分挖掘视觉-语言学习的潜力,其局限性包括过度依赖现成骨干网络进行特征提取、无效的视觉-语言融合设计以及缺乏与视觉-语言相关的损失函数。为此,我们提出了一种新颖的跟踪器,逐步探索目标中心语义以实现视觉-语言跟踪。具体而言,我们首次为视觉-语言跟踪提出了同步学习骨干,该骨干包含两个创新模块:目标增强模块和语义感知模块。这些模块使跟踪器能够以相同节奏感知目标相关语义并理解视觉和文本两种模态的上下文,从而促进不同语义层级上的视觉-语言特征提取与融合。此外,我们设计了密集匹配损失以进一步增强多模态表征学习。在视觉-语言跟踪数据集上的大量实验证明了我们方法的优越性和有效性。