Pre-training has emerged as an effective technique for learning powerful person representations. Most existing methods have shown that pre-training on pure-vision large-scale datasets like ImageNet and LUPerson has achieved remarkable performance. However, solely relying on visual information, the absence of robust explicit indicators poses a challenge for these methods to learn discriminative person representations. Drawing inspiration from the intrinsic fine-grained attribute indicators of person descriptions, we explore introducing the language modality into person representation learning. To this end, we propose a novel language-image pre-training framework for person representation learning, termed PLIP. To explicitly build fine-grained cross-modal associations, we specifically design three pretext tasks, \ie semantic-fused image colorization, visual-fused attributes prediction, and vision-language matching. In addition, due to the lack of an appropriate dataset, we present a large-scale person dataset named SYNTH-PEDES, where the Stylish Pedestrian Attributes-union Captioning method is proposed to synthesize diverse textual descriptions. We pre-train PLIP on SYNTH-PEDES and evaluate our model by spanning downstream tasks such as text-based Re-ID, image-based Re-ID, and person attribute recognition. Extensive experiments demonstrate that our model not only significantly improves existing methods on all these tasks, but also shows great ability in the few-shot and domain generalization settings. The code, dataset and weights will be released at~\url{https://github.com/Zplusdragon/PLIP}
翻译:预训练已成为学习强大行人表征的有效技术。现有大部分方法表明,在ImageNet和LUPerson等纯视觉大型数据集上进行预训练已取得显著性能。然而,仅依赖视觉信息,缺乏鲁棒的显式指标给这些方法学习判别性行人表征带来挑战。受行人描述中内在细粒度属性指标的启发,我们探索将语言模态引入行人表征学习。为此,我们提出一种新颖的语言-图像预训练框架用于行人表征学习,命名为PLIP。为显式构建细粒度跨模态关联,我们专门设计了三种前置任务,即语义融合图像着色、视觉融合属性预测以及视觉-语言匹配。此外,由于缺乏合适数据集,我们提出了名为SYNTH-PEDES的大规模行人数据集,并设计了时尚行人属性联合描述方法以合成多样化文本描述。我们在SYNTH-PEDES上预训练PLIP,并通过文本行人重识别、图像行人重识别及行人属性识别等下流任务评估模型。大量实验表明,我们的模型不仅在这些任务上显著优于现有方法,还在少样本学习和领域泛化设置中展现出强大能力。代码、数据集及权重将发布于\url{https://github.com/Zplusdragon/PLIP}。