Cross-Modal Adaptive Dual Association for Text-to-Image Person Retrieval

Text-to-image person re-identification (ReID) aims to retrieve images of a person based on a given textual description. The key challenge is to learn the relations between detailed information from visual and textual modalities. Existing works focus on learning a latent space to narrow the modality gap and further build local correspondences between two modalities. However, these methods assume that image-to-text and text-to-image associations are modality-agnostic, resulting in suboptimal associations. In this work, we show the discrepancy between image-to-text association and text-to-image association and propose CADA: Cross-Modal Adaptive Dual Association that finely builds bidirectional image-text detailed associations. Our approach features a decoder-based adaptive dual association module that enables full interaction between visual and textual modalities, allowing for bidirectional and adaptive cross-modal correspondence associations. Specifically, the paper proposes a bidirectional association mechanism: Association of text Tokens to image Patches (ATP) and Association of image Regions to text Attributes (ARA). We adaptively model the ATP based on the fact that aggregating cross-modal features based on mistaken associations will lead to feature distortion. For modeling the ARA, since the attributes are typically the first distinguishing cues of a person, we propose to explore the attribute-level association by predicting the masked text phrase using the related image region. Finally, we learn the dual associations between texts and images, and the experimental results demonstrate the superiority of our dual formulation. Codes will be made publicly available.

翻译：文本到图像行人重识别旨在根据给定的文本描述检索目标人物的图像。其关键挑战在于学习视觉与文本模态间细粒度信息的关联。现有方法致力于学习潜在空间以缩小模态鸿沟，并进一步构建两种模态间的局部对应关系。然而，这些方法假设图像到文本和文本到图像的关联是模态无关的，导致关联效果欠佳。本文揭示了图像到文本关联与文本到图像关联之间的差异性，并提出CADA：跨模态自适应双关联方法，以精细构建双向的图像-文本细粒度关联。我们的方法采用基于解码器的自适应双关联模块，实现视觉与文本模态间的充分交互，支持双向自适应的跨模态对应关联构建。具体而言，本文提出双向关联机制：文本令牌到图像块关联与图像区域到文本属性关联。基于错误关联聚合跨模态特征将导致特征畸变的认知，我们自适应建模ATP。针对ARA建模，由于属性通常是人物最显著的区别性线索，我们提出通过利用相关图像区域预测被掩码的文本短语来探索属性级关联。最终，我们学习文本与图像间的双关联，实验结果证明了双关联公式的优越性。代码将公开发布。