Siamese-DETR for Generic Multi-Object Tracking

The ability to detect and track the dynamic objects in different scenes is fundamental to real-world applications, e.g., autonomous driving and robot navigation. However, traditional Multi-Object Tracking (MOT) is limited to tracking objects belonging to the pre-defined closed-set categories. Recently, Open-Vocabulary MOT (OVMOT) and Generic MOT (GMOT) are proposed to track interested objects beyond pre-defined categories with the given text prompt and template image. However, the expensive well pre-trained (vision-)language model and fine-grained category annotations are required to train OVMOT models. In this paper, we focus on GMOT and propose a simple but effective method, Siamese-DETR, for GMOT. Only the commonly used detection datasets (e.g., COCO) are required for training. Different from existing GMOT methods, which train a Single Object Tracking (SOT) based detector to detect interested objects and then apply a data association based MOT tracker to get the trajectories, we leverage the inherent object queries in DETR variants. Specifically: 1) The multi-scale object queries are designed based on the given template image, which are effective for detecting different scales of objects with the same category as the template image; 2) A dynamic matching training strategy is introduced to train Siamese-DETR on commonly used detection datasets, which takes full advantage of provided annotations; 3) The online tracking pipeline is simplified through a tracking-by-query manner by incorporating the tracked boxes in previous frame as additional query boxes. The complex data association is replaced with the much simpler Non-Maximum Suppression (NMS). Extensive experimental results show that Siamese-DETR surpasses existing MOT methods on GMOT-40 dataset by a large margin.

翻译：检测和跟踪不同场景中动态物体的能力是现实应用（如自动驾驶和机器人导航）的基础。然而，传统的多目标跟踪仅限于跟踪属于预定义封闭类别集合中的物体。近期，开放词汇多目标跟踪和通用多目标跟踪方法被提出，通过给定文本提示和模板图像，可跟踪超出预定义类别的感兴趣目标。但训练开放词汇多目标跟踪模型需要昂贵的预训练（视觉）语言模型和细粒度类别标注。本文聚焦于通用多目标跟踪，提出一种简单而有效的方法——Siamese-DETR。该方法仅需使用常见检测数据集（如COCO）进行训练。与现有通用多目标跟踪方法不同（这些方法先训练基于单目标跟踪的检测器检测感兴趣目标，再通过数据关联的多目标跟踪器获取轨迹），我们利用DETR变体中的固有对象查询机制。具体而言：1）基于给定模板图像设计多尺度对象查询，有效检测与模板图像同类别且不同尺度的目标；2）引入动态匹配训练策略，在常用检测数据集上训练Siamese-DETR，充分利用提供的标注信息；3）通过跟踪-查询方式简化在线跟踪流程，将上一帧的跟踪框作为额外查询框融入模型。复杂的数据关联被替换为更简单的非极大值抑制。大量实验结果表明，Siamese-DETR在GMOT-40数据集上显著超越现有多目标跟踪方法。