Existing pedestrian attribute recognition (PAR) algorithms adopt pre-trained CNN (e.g., ResNet) as their backbone network for visual feature learning, which might obtain sub-optimal results due to the insufficient employment of the relations between pedestrian images and attribute labels. In this paper, we formulate PAR as a vision-language fusion problem and fully exploit the relations between pedestrian images and attribute labels. Specifically, the attribute phrases are first expanded into sentences, and then the pre-trained vision-language model CLIP is adopted as our backbone for feature embedding of visual images and attribute descriptions. The contrastive learning objective connects the vision and language modalities well in the CLIP-based feature space, and the Transformer layers used in CLIP can capture the long-range relations between pixels. Then, a multi-modal Transformer is adopted to fuse the dual features effectively and feed-forward network is used to predict attributes. To optimize our network efficiently, we propose the region-aware prompt tuning technique to adjust very few parameters (i.e., only the prompt vectors and classification heads) and fix both the pre-trained VL model and multi-modal Transformer. Our proposed PAR algorithm only adjusts 0.75% learnable parameters compared with the fine-tuning strategy. It also achieves new state-of-the-art performance on both standard and zero-shot settings for PAR, including RAPv1, RAPv2, WIDER, PA100K, and PETA-ZS, RAP-ZS datasets. The source code and pre-trained models will be released on https://github.com/Event-AHU/OpenPAR.
翻译:现有行人属性识别(PAR)算法采用预训练CNN(如ResNet)作为骨干网络进行视觉特征学习,由于未能充分利用行人图像与属性标签之间的关系,可能获得次优结果。本文将PAR表述为视觉语言融合问题,充分挖掘行人图像与属性标签之间的关系。具体而言,首先将属性短语扩展为句子,然后采用预训练的视觉语言模型CLIP作为骨干网络,对视觉图像和属性描述进行特征嵌入。对比学习目标在基于CLIP的特征空间中良好连接了视觉和语言模态,CLIP中使用的Transformer层能够捕获像素间的长程关系。随后采用多模态Transformer有效融合双模态特征,并通过前馈网络预测属性。为高效优化网络,我们提出区域感知提示调优技术,仅调整极少量参数(即仅提示向量和分类头),同时固定预训练VL模型和多模态Transformer。与微调策略相比,本算法仅调整0.75%的可学习参数。在标准与零样本PAR设置中,包括RAPv1、RAPv2、WIDER、PA100K及PETA-ZS、RAP-ZS数据集上均取得了新的最佳性能。源代码与预训练模型将发布于https://github.com/Event-AHU/OpenPAR。