Discriminative representation is essential to keep a unique identifier for each target in Multiple object tracking (MOT). Some recent MOT methods extract features of the bounding box region or the center point as identity embeddings. However, when targets are occluded, these coarse-grained global representations become unreliable. To this end, we propose exploring diverse fine-grained representation, which describes appearance comprehensively from global and local perspectives. This fine-grained representation requires high feature resolution and precise semantic information. To effectively alleviate the semantic misalignment caused by indiscriminate contextual information aggregation, Flow Alignment FPN (FAFPN) is proposed for multi-scale feature alignment aggregation. It generates semantic flow among feature maps from different resolutions to transform their pixel positions. Furthermore, we present a Multi-head Part Mask Generator (MPMG) to extract fine-grained representation based on the aligned feature maps. Multiple parallel branches of MPMG allow it to focus on different parts of targets to generate local masks without label supervision. The diverse details in target masks facilitate fine-grained representation. Eventually, benefiting from a Shuffle-Group Sampling (SGS) training strategy with positive and negative samples balanced, we achieve state-of-the-art performance on MOT17 and MOT20 test sets. Even on DanceTrack, where the appearance of targets is extremely similar, our method significantly outperforms ByteTrack by 5.0% on HOTA and 5.6% on IDF1. Extensive experiments have proved that diverse fine-grained representation makes Re-ID great again in MOT.
翻译:判别性表示对于在多目标跟踪(MOT)中为每个目标保持唯一标识至关重要。近年来一些MOT方法提取边界框区域或中心点的特征作为身份嵌入。然而,当目标被遮挡时,这些粗粒度的全局表示变得不可靠。为此,我们提出探索多样化细粒度表示,从全局和局部视角全面描述外观。这种细粒度表示需要高特征分辨率与精确语义信息。为有效缓解无差别上下文信息聚合导致的语义错位问题,提出流对齐特征金字塔网络(FAFPN)实现多尺度特征对齐聚合。该方法在不同分辨率特征图间生成语义流以转换像素位置。此外,我们提出多头部件掩码生成器(MPMG),基于对齐后的特征图提取细粒度表示。MPMG的多并行分支无需标签监督即可聚焦目标不同部位生成局部掩码。目标掩码中的多样化细节有助于细粒度表示。最终,通过正负样本均衡的混洗组采样(SGS)训练策略,我们在MOT17与MOT20测试集上达到最先进性能。即使在目标外观极为相似的DanceTrack数据集上,我们的方法在HOTA和IDF1指标上分别以5.0%和5.6%的绝对优势超越ByteTrack。大量实验证明,多样化细粒度表示使MOT中的重识别(Re-ID)重焕生机。