In person re-identification (re-ID), extracting part-level features from person images has been verified to be crucial to offer fine-grained information. Most of the existing CNN-based methods only locate the human parts coarsely, or rely on pretrained human parsing models and fail in locating the identifiable nonhuman parts (e.g., knapsack). In this article, we introduce an alignment scheme in transformer architecture for the first time and propose the auto-aligned transformer (AAformer) to automatically locate both the human parts and nonhuman ones at patch level. We introduce the "Part tokens ([PART]s)", which are learnable vectors, to extract part features in the transformer. A [PART] only interacts with a local subset of patches in self-attention and learns to be the part representation. To adaptively group the image patches into different subsets, we design the auto-alignment. Auto-alignment employs a fast variant of optimal transport (OT) algorithm to online cluster the patch embeddings into several groups with the [PART]s as their prototypes. AAformer integrates the part alignment into the self-attention and the output [PART]s can be directly used as part features for retrieval. Extensive experiments validate the effectiveness of [PART]s and the superiority of AAformer over various state-of-the-art methods.
翻译:在行人重识别任务中,从行人图像中提取部件级特征已被证实对提供细粒度信息至关重要。现有的大多数基于CNN的方法仅能粗略定位人体部件,或依赖预训练的人体解析模型,无法定位可识别的非人体部件(如背包)。本文首次在Transformer架构中引入对齐机制,提出了自动对齐Transformer(AAformer),以在图像块级别自动定位人体与非人体部件。我们引入了“部件标记([PART]s)”作为可学习向量,用于在Transformer中提取部件特征。每个[PART]标记仅通过自注意力机制与局部图像块子集交互,并学习成为部件表示。为了将图像块自适应地分组到不同子集,我们设计了自动对齐模块。该模块采用快速最优传输算法变体,在线将图像块嵌入聚类为若干组,并以[PART]s作为其原型。AAformer将部件对齐过程整合到自注意力计算中,其输出的[PART]s可直接作为检索用的部件特征。大量实验验证了[PART]s的有效性,并证明了AAformer优于多种先进方法。