Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description. Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect. In this paper, we make the first attempt to explore TBPS without parallel image-text data ($\mu$-TBPS), in which only non-parallel images and texts, or even image-only data, can be adopted. Towards this end, we propose a two-stage framework, generation-then-retrieval (GTR), to first generate the corresponding pseudo text for each image and then perform the retrieval in a supervised manner. In the generation stage, we propose a fine-grained image captioning strategy to obtain an enriched description of the person image, which firstly utilizes a set of instruction prompts to activate the off-the-shelf pretrained vision-language model to capture and generate fine-grained person attributes, and then converts the extracted attributes into a textual description via the finetuned large language model or the hand-crafted template. In the retrieval stage, considering the noise interference of the generated texts for training model, we develop a confidence score-based training scheme by enabling more reliable texts to contribute more during the training. Experimental results on multiple TBPS benchmarks (i.e., CUHK-PEDES, ICFG-PEDES and RSTPReid) show that the proposed GTR can achieve a promising performance without relying on parallel image-text data.
翻译:基于文本的行人搜索(TBPS)旨在根据给定的自然语言描述,从大型图像库中检索目标人物的图像。现有方法主要依赖于使用并行图文对训练模型,而这类数据的收集成本极高。本文首次探索无需并行图文数据的TBPS($\mu$-TBPS),该方法仅需非并行图像与文本,甚至仅使用图像数据即可实现。为此,我们提出了一种两阶段框架——生成后检索(GTR),首先为每张图像生成对应的伪文本,随后以监督方式执行检索。在生成阶段,我们提出了一种细粒度图像描述策略,以获取行人图像的丰富描述:该策略首先利用一组指令提示激活现成的预训练视觉-语言模型,捕捉并生成细粒度的人物属性,随后通过微调后的大语言模型或手工模板将提取的属性转换为文本描述。在检索阶段,考虑到生成文本对模型训练的噪声干扰,我们开发了一种基于置信度分数的训练方案,使更可靠的文本在训练过程中贡献更多。在多个TBPS基准(即CUHK-PEDES、ICFG-PEDES和RSTPReid)上的实验结果表明,所提出的GTR方法无需依赖并行图文数据即可实现有竞争力的性能。