Attribute-Aware Implicit Modality Alignment for Text Attribute Person Search

Text attribute person search aims to find specific pedestrians through given textual attributes, which is very meaningful in the scene of searching for designated pedestrians through witness descriptions. The key challenge is the significant modality gap between textual attributes and images. Previous methods focused on achieving explicit representation and alignment through unimodal pre-trained models. Nevertheless, the absence of inter-modality correspondence in these models may lead to distortions in the local information of intra-modality. Moreover, these methods only considered the alignment of inter-modality and ignored the differences between different attribute categories. To mitigate the above problems, we propose an Attribute-Aware Implicit Modality Alignment (AIMA) framework to learn the correspondence of local representations between textual attributes and images and combine global representation matching to narrow the modality gap. Firstly, we introduce the CLIP model as the backbone and design prompt templates to transform attribute combinations into structured sentences. This facilitates the model's ability to better understand and match image details. Next, we design a Masked Attribute Prediction (MAP) module that predicts the masked attributes after the interaction of image and masked textual attribute features through multi-modal interaction, thereby achieving implicit local relationship alignment. Finally, we propose an Attribute-IoU Guided Intra-Modal Contrastive (A-IoU IMC) loss, aligning the distribution of different textual attributes in the embedding space with their IoU distribution, achieving better semantic arrangement. Extensive experiments on the Market-1501 Attribute, PETA, and PA100K datasets show that the performance of our proposed method significantly surpasses the current state-of-the-art methods.

翻译：文本属性行人搜索旨在通过给定的文本属性查找特定行人，这在通过目击者描述搜索指定行人的场景中具有重要意义。关键挑战在于文本属性与图像之间存在显著的模态鸿沟。先前的方法侧重于通过单模态预训练模型实现显式表示与对齐。然而，这些模型缺乏模态间对应关系，可能导致模态内局部信息的失真。此外，这些方法仅考虑了模态间的对齐，而忽略了不同属性类别之间的差异。为缓解上述问题，我们提出了一种属性感知的隐式模态对齐框架，通过学习文本属性与图像之间局部表示的对应关系，并结合全局表示匹配来缩小模态差距。首先，我们引入CLIP模型作为主干网络，并设计提示模板将属性组合转化为结构化句子。这有助于模型更好地理解与匹配图像细节。接着，我们设计了一个掩码属性预测模块，该模块通过多模态交互，基于图像与掩码文本属性特征的交互来预测被掩码的属性，从而实现隐式的局部关系对齐。最后，我们提出了一种属性交并比引导的模态内对比损失，将不同文本属性在嵌入空间中的分布与其交并比分布对齐，实现更优的语义排列。在Market-1501 Attribute、PETA和PA100K数据集上的大量实验表明，我们所提方法的性能显著超越了当前最先进的方法。