Multi-Granularity Language-Guided Multi-Object Tracking

Most existing multi-object tracking methods typically learn visual tracking features via maximizing dis-similarities of different instances and minimizing similarities of the same instance. While such a feature learning scheme achieves promising performance, learning discriminative features solely based on visual information is challenging especially in case of environmental interference such as occlusion, blur and domain variance. In this work, we argue that multi-modal language-driven features provide complementary information to classical visual features, thereby aiding in improving the robustness to such environmental interference. To this end, we propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity (scene-and instance-level) and combines it with standard visual features to obtain discriminative representations. To develop LG-MOT, we annotate existing MOT datasets with scene-and instance-level language descriptions. We then encode both instance-and scene-level language information into high-dimensional embeddings, which are utilized to guide the visual features during training. At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions. Extensive experiments on three benchmarks, MOT17, DanceTrack and SportsMOT, reveal the merits of the proposed contributions leading to state-of-the-art performance. On the DanceTrack test set, our LG-MOT achieves an absolute gain of 2.2\% in terms of target object association (IDF1 score), compared to the baseline using only visual features. Further, our LG-MOT exhibits strong cross-domain generalizability. The dataset and code will be available at ~\url{https://github.com/WesLee88524/LG-MOT}.

翻译：现有的大多数多目标跟踪方法通常通过最大化不同实例间的差异性和最小化同一实例的相似性来学习视觉跟踪特征。尽管这种特征学习方案取得了良好的性能，但仅基于视觉信息学习判别性特征具有挑战性，尤其是在存在遮挡、模糊和域差异等环境干扰的情况下。本工作认为，多模态语言驱动的特征为经典视觉特征提供了互补信息，从而有助于提高对此类环境干扰的鲁棒性。为此，我们提出了一种新的多目标跟踪框架，命名为LG-MOT，该框架显式地利用不同粒度（场景级和实例级）的语言信息，并将其与标准视觉特征相结合，以获得判别性表示。为了开发LG-MOT，我们为现有的MOT数据集标注了场景级和实例级的语言描述。随后，我们将实例级和场景级的语言信息编码为高维嵌入，这些嵌入在训练过程中用于引导视觉特征。在推理阶段，我们的LG-MOT使用标准视觉特征，无需依赖标注的语言描述。在MOT17、DanceTrack和SportsMOT三个基准数据集上进行的大量实验揭示了所提贡献的优点，并实现了最先进的性能。在DanceTrack测试集上，与仅使用视觉特征的基线相比，我们的LG-MOT在目标对象关联（IDF1分数）方面实现了2.2%的绝对增益。此外，我们的LG-MOT展现出强大的跨域泛化能力。数据集和代码将在~\url{https://github.com/WesLee88524/LG-MOT} 发布。