In recent years, dominant Multi-object tracking (MOT) and segmentation (MOTS) methods mainly follow the tracking-by-detection paradigm. Transformer-based end-to-end (E2E) solutions bring some ideas to MOT and MOTS, but they cannot achieve a new state-of-the-art (SOTA) performance in major MOT and MOTS benchmarks. Detection and association are two main modules of the tracking-by-detection paradigm. Association techniques mainly depend on the combination of motion and appearance information. As deep learning has been recently developed, the performance of the detection and appearance model is rapidly improved. These trends made us consider whether we can achieve SOTA based on only high-performance detection and appearance model. Our paper mainly focuses on exploring this direction based on CBNetV2 with Swin-B as a detection model and MoCo-v2 as a self-supervised appearance model. Motion information and IoU mapping were removed during the association. Our method wins 1st place on the MOTS track and wins 2nd on the MOT track in the CVPR2023 WAD workshop. We hope our simple and effective method can give some insights to the MOT and MOTS research community. Source code will be released under this git repository
翻译:近年来,主流的的多目标跟踪(MOT)与多目标分割(MOTS)方法主要遵循基于检测的跟踪范式。基于Transformer的端到端(E2E)方案虽为MOT和MOTS领域带来新思路,但尚未在主要基准测试中实现最先进的(SOTA)性能。检测与关联是基于检测跟踪范式的两大核心模块。关联技术主要依赖运动信息与外观特征的结合。随着深度学习的发展,检测模型与外观模型的性能迅速提升。这一趋势促使我们思考:能否仅凭高性能的检测模型与外观模型实现SOTA?本文旨在探索该方向,采用基于Swin-B的CBNetV2作为检测模型,以及MoCo-v2作为自监督外观模型,并在关联过程中完全摒弃运动信息与IoU映射。我们的方法在CVPR2023 WAD研讨会中斩获MOTS赛道第一名、MOT赛道第二名。希望这一简单有效的方法能为MOT与MOTS研究领域提供新思路。源代码将发布于以下Git仓库。