The DEtection TRansformer (DETR) opened new possibilities for object detection by modeling it as a translation task: converting image features into object-level representations. Previous works typically add expensive modules to DETR to perform Multi-Object Tracking (MOT), resulting in more complicated architectures. We instead show how DETR can be turned into a MOT model by employing an instance-level contrastive loss, a revised sampling strategy and a lightweight assignment method. Our training scheme learns object appearances while preserving detection capabilities and with little overhead. Its performance surpasses the previous state-of-the-art by +2.6 mMOTA on the challenging BDD100K dataset and is comparable to existing transformer-based methods on the MOT17 dataset.
翻译:DEtection TRansformer(DETR)通过将目标检测建模为翻译任务(将图像特征转换为目标级表示)为目标检测开辟了新可能性。以往工作通常需要在DETR中添加复杂模块来实现多目标跟踪(MOT),导致架构愈发复杂。我们则通过采用实例级对比损失、改进的采样策略和轻量级分配方法,展示了如何将DETR转化为MOT模型。我们的训练方案能够在保持检测能力的同时学习目标外观,且计算开销极小。该方法在具有挑战性的BDD100K数据集上以+2.6 mMOTA的性能超越了先前最优方法,在MOT17数据集上则与现有基于Transformer的方法性能相当。