POAR: Towards Open Vocabulary Pedestrian Attribute Recognition

Pedestrian attribute recognition (PAR) aims to predict the attributes of a target pedestrian in a surveillance system. Existing methods address the PAR problem by training a multi-label classifier with predefined attribute classes. However, it is impossible to exhaust all pedestrian attributes in the real world. To tackle this problem, we develop a novel pedestrian open-attribute recognition (POAR) framework. Our key idea is to formulate the POAR problem as an image-text search problem. We design a Transformer-based image encoder with a masking strategy. A set of attribute tokens are introduced to focus on specific pedestrian parts (e.g., head, upper body, lower body, feet, etc.) and encode corresponding attributes into visual embeddings. Each attribute category is described as a natural language sentence and encoded by the text encoder. Then, we compute the similarity between the visual and text embeddings of attributes to find the best attribute descriptions for the input images. Different from existing methods that learn a specific classifier for each attribute category, we model the pedestrian at a part-level and explore the searching method to handle the unseen attributes. Finally, a many-to-many contrastive (MTMC) loss with masked tokens is proposed to train the network since a pedestrian image can comprise multiple attributes. Extensive experiments have been conducted on benchmark PAR datasets with an open-attribute setting. The results verified the effectiveness of the proposed POAR method, which can form a strong baseline for the POAR task. Our code is available at \url{https://github.com/IvyYZ/POAR}.

翻译：行人属性识别（PAR）旨在预测监控系统中目标行人的属性。现有方法通过预定义属性类别训练多标签分类器来解决PAR问题，但在现实世界中，不可能穷举所有行人属性。为解决这一问题，我们提出一种新颖的行人开放属性识别（POAR）框架。核心思路是将POAR问题转化为图像-文本搜索问题。我们设计了一种基于Transformer的图像编码器，并采用掩码策略。一组属性令牌被引入以聚焦特定行人部位（如头部、上身、下身、脚部等），并将对应属性编码为视觉嵌入。每个属性类别被描述为自然语言句子，并由文本编码器编码。随后，我们计算属性视觉嵌入与文本嵌入之间的相似度，为输入图像找到最佳属性描述。不同于现有方法为每个属性类别学习特定分类器，我们对行人进行部位级建模，并探索搜索方法来处理未见属性。最后，针对行人图像可能包含多个属性的情况，我们提出带掩码令牌的多对多对比（MTMC）损失函数来训练网络。在开放属性设置下的基准PAR数据集上进行了大量实验，结果验证了所提POAR方法的有效性，该工作可为POAR任务建立强基线。我们的代码开源在\url{https://github.com/IvyYZ/POAR}。