The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions. A primary challenge in this task is bridging the substantial representational gap between visual and textual modalities. The prevailing methods map texts and images into unified embedding space for matching, while the intricate semantic correspondences between texts and images are still not effectively constructed. To address this issue, we propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts. Specifically, via fine-tuning the Contrastive Language-Image Pre-training (CLIP) model, a visual-textual dual encoder is firstly constructed, to preliminarily align the image and text features. Secondly, a Text-guided Image Restoration (TIR) auxiliary task is proposed to map abstract textual entities to specific image regions, improving the alignment between local textual and visual embeddings. Additionally, a cross-modal triplet loss is presented to handle hard samples, and further enhance the model's discriminability for minor differences. Moreover, a pruning-based text data augmentation approach is proposed to enhance focus on essential elements in descriptions, thereby avoiding excessive model attention to less significant information. The experimental results show our proposed method outperforms state-of-the-art methods on three popular benchmark datasets, and the code will be made publicly available at https://github.com/Delong-liu-bupt/SEN.
翻译:文本到图像行人检索(TIPR)的目标是根据给定的文本描述检索特定行人图像。该任务的主要挑战在于弥合视觉与文本模态间巨大的表征鸿沟。现有主流方法将文本和图像映射到统一的嵌入空间进行匹配,但文本与图像间复杂的语义对应关系仍未得到有效构建。为解决这一问题,我们提出了一种新颖的TIPR框架,旨在建立行人图像与对应文本间的细粒度交互与对齐。具体而言,首先通过微调对比语言-图像预训练(CLIP)模型构建视觉-文本双编码器,初步实现图像与文本特征的对齐。其次,提出文本引导图像修复(TIR)辅助任务,将抽象文本实体映射到具体图像区域,从而提升局部文本嵌入与视觉嵌入的对齐效果。此外,引入跨模态三元组损失以处理困难样本,进一步增强模型对细微差异的判别能力。同时,提出基于剪枝的文本数据增强方法,强化对描述中关键要素的关注,避免模型过度关注次要信息。实验结果表明,所提方法在三个主流基准数据集上均优于现有最优方法,代码将在https://github.com/Delong-liu-bupt/SEN公开。