The ability to detect and track the dynamic objects in different scenes is fundamental to real-world applications, e.g., autonomous driving and robot navigation. However, traditional Multi-Object Tracking (MOT) is limited to tracking objects belonging to the pre-defined closed-set categories. Recently, Open-Vocabulary MOT (OVMOT) and Generic MOT (GMOT) are proposed to track interested objects beyond pre-defined categories with the given text prompt and template image. However, the expensive well pre-trained (vision-)language model and fine-grained category annotations are required to train OVMOT models. In this paper, we focus on GMOT and propose a simple but effective method, Siamese-DETR, for GMOT. Only the commonly used detection datasets (e.g., COCO) are required for training. Different from existing GMOT methods, which train a Single Object Tracking (SOT) based detector to detect interested objects and then apply a data association based MOT tracker to get the trajectories, we leverage the inherent object queries in DETR variants. Specifically: 1) The multi-scale object queries are designed based on the given template image, which are effective for detecting different scales of objects with the same category as the template image; 2) A dynamic matching training strategy is introduced to train Siamese-DETR on commonly used detection datasets, which takes full advantage of provided annotations; 3) The online tracking pipeline is simplified through a tracking-by-query manner by incorporating the tracked boxes in previous frame as additional query boxes. The complex data association is replaced with the much simpler Non-Maximum Suppression (NMS). Extensive experimental results show that Siamese-DETR surpasses existing MOT methods on GMOT-40 dataset by a large margin.
翻译:检测和跟踪不同场景中动态物体的能力是现实应用(如自动驾驶和机器人导航)的基础。然而,传统的多目标跟踪仅限于跟踪属于预定义封闭类别集合中的物体。近期,开放词汇多目标跟踪和通用多目标跟踪方法被提出,通过给定文本提示和模板图像,可跟踪超出预定义类别的感兴趣目标。但训练开放词汇多目标跟踪模型需要昂贵的预训练(视觉)语言模型和细粒度类别标注。本文聚焦于通用多目标跟踪,提出一种简单而有效的方法——Siamese-DETR。该方法仅需使用常见检测数据集(如COCO)进行训练。与现有通用多目标跟踪方法不同(这些方法先训练基于单目标跟踪的检测器检测感兴趣目标,再通过数据关联的多目标跟踪器获取轨迹),我们利用DETR变体中的固有对象查询机制。具体而言:1)基于给定模板图像设计多尺度对象查询,有效检测与模板图像同类别且不同尺度的目标;2)引入动态匹配训练策略,在常用检测数据集上训练Siamese-DETR,充分利用提供的标注信息;3)通过跟踪-查询方式简化在线跟踪流程,将上一帧的跟踪框作为额外查询框融入模型。复杂的数据关联被替换为更简单的非极大值抑制。大量实验结果表明,Siamese-DETR在GMOT-40数据集上显著超越现有多目标跟踪方法。