Pre-training is playing an increasingly important role in learning generic feature representation for Person Re-identification (ReID). We argue that a high-quality ReID representation should have three properties, namely, multi-level awareness, occlusion robustness, and cross-region invariance. To this end, we propose a simple yet effective pre-training framework, namely PersonMAE, which involves two core designs into masked autoencoders to better serve the task of Person Re-ID. 1) PersonMAE generates two regions from the given image with RegionA as the input and \textit{RegionB} as the prediction target. RegionA is corrupted with block-wise masking to mimic common occlusion in ReID and its remaining visible parts are fed into the encoder. 2) Then PersonMAE aims to predict the whole RegionB at both pixel level and semantic feature level. It encourages its pre-trained feature representations with the three properties mentioned above. These properties make PersonMAE compatible with downstream Person ReID tasks, leading to state-of-the-art performance on four downstream ReID tasks, i.e., supervised (holistic and occluded setting), and unsupervised (UDA and USL setting). Notably, on the commonly adopted supervised setting, PersonMAE with ViT-B backbone achieves 79.8% and 69.5% mAP on the MSMT17 and OccDuke datasets, surpassing the previous state-of-the-art by a large margin of +8.0 mAP, and +5.3 mAP, respectively.
翻译:预训练在学习行人重识别(ReID)的通用特征表示中发挥着日益重要的作用。我们认为高质量的行人重识别表示应具备三个特性:多层级感知能力、遮挡鲁棒性以及跨区域不变性。为此,我们提出一种简洁而有效的预训练框架PersonMAE,该框架在掩码自编码器中引入两个核心设计以更好地服务于行人重识别任务:1)PersonMAE从给定图像生成两个区域,其中RegionA作为输入,RegionB作为预测目标。通过分块掩码对RegionA进行破坏以模拟行人在重识别中常见的遮挡场景,并将其可见部分输入编码器;2)PersonMAE旨在像素级和语义特征级同时预测完整的RegionB,从而促使预训练特征表示具备上述三种特性。这些特性使PersonMAE能够高效适配下游行人重识别任务,并在四种主流ReID任务(全监督完整场景、全监督遮挡场景、无监督域适应场景、无监督无标签场景)中均取得最优性能。值得注意的是,在广泛采用的全监督场景中,基于ViT-B骨干网络的PersonMAE在MSMT17和OccDuke数据集上分别达到79.8%和69.5%的平均精度均值(mAP),相较于此前最优方法分别显著提升+8.0和+5.3个百分点的mAP。