Model pre-training is essential in human-centric perception. In this paper, we first introduce masked image modeling (MIM) as a pre-training approach for this task. Upon revisiting the MIM training strategy, we reveal that human structure priors offer significant potential. Motivated by this insight, we further incorporate an intuitive human structure prior - human parts - into pre-training. Specifically, we employ this prior to guide the mask sampling process. Image patches, corresponding to human part regions, have high priority to be masked out. This encourages the model to concentrate more on body structure information during pre-training, yielding substantial benefits across a range of human-centric perception tasks. To further capture human characteristics, we propose a structure-invariant alignment loss that enforces different masked views, guided by the human part prior, to be closely aligned for the same image. We term the entire method as HAP. HAP simply uses a plain ViT as the encoder yet establishes new state-of-the-art performance on 11 human-centric benchmarks, and on-par result on one dataset. For example, HAP achieves 78.1% mAP on MSMT17 for person re-identification, 86.54% mA on PA-100K for pedestrian attribute recognition, 78.2% AP on MS COCO for 2D pose estimation, and 56.0 PA-MPJPE on 3DPW for 3D pose and shape estimation.
翻译:摘要:模型预训练在人本感知任务中至关重要。本文首次将掩码图像建模(MIM)引入作为该任务的预训练方法。通过重新审视MIM训练策略,我们揭示了人体结构先验知识蕴含巨大潜力。受此启发,我们进一步将直观的人体结构先验——人体部件——融入预训练过程。具体而言,我们利用该先验指导掩码采样过程:对应人体部件区域的图像块具有更高被掩码优先级,从而促使模型在预训练阶段更关注身体结构信息,为一系列人体中心感知任务带来显著收益。为更好捕捉人体特征,我们提出结构不变对齐损失,强制由人体部件先验引导的不同掩码视图在同一图像上保持紧密对齐。我们将该方法命名为HAP。HAP仅采用朴素ViT作为编码器,却在11个人体中心感知基准上创下新最优性能,在1个数据集上达到持平结果。例如,HAP在MSMT17行人重识别任务中达78.1% mAP,在PA-100K行人属性识别任务中达86.54% mA,在MS COCO二维姿态估计任务中达78.2% AP,在3DPW三维姿态与形状估计任务中达56.0 PA-MPJPE。