Existing end-to-end Multi-Object Tracking (e2e-MOT) methods have not surpassed non-end-to-end tracking-by-detection methods. One potential reason is its label assignment strategy during training that consistently binds the tracked objects with tracking queries and then assigns the few newborns to detection queries. With one-to-one bipartite matching, such an assignment will yield unbalanced training, i.e., scarce positive samples for detection queries, especially for an enclosed scene, as the majority of the newborns come on stage at the beginning of videos. Thus, e2e-MOT will be easier to yield a tracking terminal without renewal or re-initialization, compared to other tracking-by-detection methods. To alleviate this problem, we present Co-MOT, a simple and effective method to facilitate e2e-MOT by a novel coopetition label assignment with a shadow concept. Specifically, we add tracked objects to the matching targets for detection queries when performing the label assignment for training the intermediate decoders. For query initialization, we expand each query by a set of shadow counterparts with limited disturbance to itself. With extensive ablations, Co-MOT achieves superior performance without extra costs, e.g., 69.4% HOTA on DanceTrack and 52.8% TETA on BDD100K. Impressively, Co-MOT only requires 38\% FLOPs of MOTRv2 to attain a similar performance, resulting in the 1.4$\times$ faster inference speed.
翻译:现有的端到端多目标跟踪(e2e-MOT)方法尚未超越非端到端的基于检测的跟踪方法。一个潜在的原因是其训练过程中的标签分配策略,该策略将已跟踪目标与跟踪查询(tracking queries)持续绑定,而仅将少量新出现目标分配给检测查询(detection queries)。通过一对一的双向匹配,这种分配会导致训练不平衡,即检测查询的正样本稀缺,尤其在封闭场景中,因为大多数新目标出现在视频开头。因此,与其它基于检测的跟踪方法相比,e2e-MOT更易产生缺乏更新或重新初始化的跟踪终止问题。为缓解此问题,我们提出Co-MOT——一种简单有效的方法,通过引入带有Shadow概念的新型竞争协同标签分配(coopetition label assignment)来促进e2e-MOT。具体而言,当为训练中间解码器执行标签分配时,我们将已跟踪目标加入检测查询的匹配目标中。对于查询初始化,我们通过一组对其施加有限扰动的Shadow副本(shadow counterparts)来扩展每个查询。通过大量消融实验,Co-MOT在不增加额外成本的情况下实现了优越性能,例如在DanceTrack上达到69.4% HOTA,在BDD100K上达到52.8% TETA。令人印象深刻的是,Co-MOT仅需MOTRv2的38% FLOPs即可达到相近性能,从而实现1.4倍更快的推理速度。