Divert More Attention to Vision-Language Object Tracking

Multimodal vision-language (VL) learning has noticeably pushed the tendency toward generic intelligence owing to emerging large foundation models. However, tracking, as a fundamental vision problem, surprisingly enjoys less bonus from recent flourishing VL learning. We argue that the reasons are two-fold: the lack of large-scale vision-language annotated videos and ineffective vision-language interaction learning of current works. These nuisances motivate us to design more effective vision-language representation for tracking, meanwhile constructing a large database with language annotation for model learning. Particularly, in this paper, we first propose a general attribute annotation strategy to decorate videos in six popular tracking benchmarks, which contributes a large-scale vision-language tracking database with more than 23,000 videos. We then introduce a novel framework to improve tracking by learning a unified-adaptive VL representation, where the cores are the proposed asymmetric architecture search and modality mixer (ModaMixer). To further improve VL representation, we introduce a contrastive loss to align different modalities. To thoroughly evidence the effectiveness of our method, we integrate the proposed framework on three tracking methods with different designs, i.e., the CNN-based SiamCAR, the Transformer-based OSTrack, and the hybrid structure TransT. The experiments demonstrate that our framework can significantly improve all baselines on six benchmarks. Besides empirical results, we theoretically analyze our approach to show its rationality. By revealing the potential of VL representation, we expect the community to divert more attention to VL tracking and hope to open more possibilities for future tracking with diversified multimodal messages.

翻译：多模态视觉-语言学习因新兴大基础模型的出现，显著推动了通用智能的发展趋势。然而，作为基础视觉问题的目标跟踪，却意外地从当前蓬勃发展的视觉-语言学习中获益较少。我们认为其原因有两方面：缺乏大规模视觉-语言标注视频以及现有工作对视觉-语言交互学习效果不佳。这些问题促使我们设计更有效的视觉-语言表征用于跟踪，同时构建带有语言标注的大型数据库以支持模型学习。具体而言，本文首先提出一种通用属性标注策略，为六个主流跟踪基准数据集中的视频添加标注，构建了包含超过23000个视频的大规模视觉-语言跟踪数据库。随后，我们引入一个新颖框架，通过学习统一自适应的视觉-语言表征来改进跟踪，其核心是提出的非对称架构搜索与模态混合器。为进一步提升视觉-语言表征，我们引入对比损失以对齐不同模态。为充分验证方法有效性，我们将所提框架集成到三种不同设计的跟踪方法中：基于CNN的SiamCAR、基于Transformer的OSTrack以及混合结构TransT。实验表明，本框架在六个基准数据集上均显著提升了所有基线方法。除实证结果外，我们通过理论分析证明了方法的合理性。通过揭示视觉-语言表征的潜力，我们期望学界更多关注视觉-语言跟踪，并为未来利用多样化多模态信息的跟踪研究开辟更多可能性。