Searching for specific person has great social benefits and security value, and it often involves a combination of visual and textual information. Conventional person retrieval methods, whether image-based or text-based, usually fall short in effectively harnessing both types of information, leading to the loss of accuracy. In this paper, a whole new task called Composed Person Retrieval (CPR) is proposed to jointly utilize both image and text information for target person retrieval. However, the supervised CPR requires very costly manual annotation dataset, while there are currently no available resources. To mitigate this issue, we firstly introduce the Zero-shot Composed Person Retrieval (ZS-CPR), which leverages existing domain-related data to resolve the CPR problem without expensive annotations. Secondly, to learn ZS-CPR model, we propose a two-stage learning framework, Word4Per, where a lightweight Textual Inversion Network (TINet) and a text-based person retrieval model based on fine-tuned Contrastive Language-Image Pre-training (CLIP) network are learned without utilizing any CPR data. Thirdly, a finely annotated Image-Text Composed Person Retrieval (ITCPR) dataset is built as the benchmark to assess the performance of the proposed Word4Per framework. Extensive experiments under both Rank-1 and mAP demonstrate the effectiveness of Word4Per for the ZS-CPR task, surpassing the comparative methods by over 10\%. The code and ITCPR dataset will be publicly available at https://github.com/Delong-liu-bupt/Word4Per.
翻译:针对特定行人的检索具有重要的社会效益和安全价值,通常需要结合视觉与文本信息。传统的行人检索方法(无论是基于图像还是基于文本)通常难以有效利用这两类信息,导致精度损失。本文提出了一项全新任务——组合式行人检索(CPR),旨在联合利用图像和文本信息进行目标行人检索。然而,有监督的CPR需要昂贵的人工标注数据集,而当前尚无可用资源。为缓解此问题,我们首先引入零样本组合式行人检索(ZS-CPR),利用现有领域相关数据解决CPR问题,无需昂贵标注。其次,为学习ZS-CPR模型,我们提出两阶段学习框架Word4Per,其中轻量级文本反转网络(TINet)和基于微调对比语言-图像预训练(CLIP)网络的文本行人检索模型,均在不使用任何CPR数据的情况下完成学习。第三,构建了精细标注的图像-文本组合式行人检索(ITCPR)数据集,作为评估所提Word4Per框架性能的基准。在Rank-1和mAP指标上的大量实验表明,Word4Per在ZS-CPR任务上具有有效性,性能超过对比方法10%以上。代码和ITCPR数据集将在https://github.com/Delong-liu-bupt/Word4Per公开。