The goal of Text-to-image person retrieval is to retrieve person images from a large gallery that match the given textual descriptions. The main challenge of this task lies in the significant differences in information representation between the visual and textual modalities. The textual modality conveys abstract and precise information through vocabulary and grammatical structures, while the visual modality conveys concrete and intuitive information through images. To fully leverage the expressive power of textual representations, it is essential to accurately map abstract textual descriptions to specific images. To address this issue, we propose a novel framework to Unleash the Imagination of Text (UIT) in text-to-image person retrieval, aiming to fully explore the power of words in sentences. Specifically, the framework employs the pre-trained full CLIP model as a dual encoder for the images and texts , taking advantage of prior cross-modal alignment knowledge. The Text-guided Image Restoration auxiliary task is proposed with the aim of implicitly mapping abstract textual entities to specific image regions, facilitating alignment between textual and visual embeddings. Additionally, we introduce a cross-modal triplet loss tailored for handling hard samples, enhancing the model's ability to distinguish minor differences. To focus the model on the key components within sentences, we propose a novel text data augmentation technique. Our proposed methods achieve state-of-the-art results on three popular benchmark datasets, and the source code will be made publicly available shortly.
翻译:文本到图像行人检索的目标是从大规模图像库中检索与给定文本描述匹配的行人图像。该任务的主要挑战在于视觉模态与文本模态之间信息表征的显著差异。文本模态通过词汇和语法结构传递抽象且精准的信息,而视觉模态则通过图像传递具体直观的信息。为充分利用文本表征的表达能力,必须将抽象文本描述精确映射到特定图像。针对此问题,我们提出了一种新颖框架——释放文本想象力(Unleash the Imagination of Text, UIT),旨在充分探索句子中词汇的力量。具体而言,该框架采用预训练完整CLIP模型作为图像与文本的双编码器,利用其跨模态对齐的先验知识。我们提出了文本引导图像复原辅助任务,旨在隐式地将抽象文本实体映射到特定图像区域,促进文本与视觉嵌入的对齐。此外,我们引入了一种专门处理困难样本的跨模态三元组损失,增强模型区分细微差异的能力。为使模型聚焦于句子中的关键成分,我们提出了一种新颖的文本数据增强技术。所提方法在三个主流基准数据集上取得了最先进结果,源代码将稍后公开。