In this paper, we introduce a large Multi-Attribute and Language Search dataset for text-based person retrieval, called MALS, and explore the feasibility of performing pre-training on both attribute recognition and image-text matching tasks in one stone. In particular, MALS contains 1,510,330 image-text pairs, which is about 37.5 times larger than prevailing CUHK-PEDES, and all images are annotated with 27 attributes. Considering the privacy concerns and annotation costs, we leverage the off-the-shelf diffusion models to generate the dataset. To verify the feasibility of learning from the generated data, we develop a new joint Attribute Prompt Learning and Text Matching Learning (APTM) framework, considering the shared knowledge between attribute and text. As the name implies, APTM contains an attribute prompt learning stream and a text matching learning stream. (1) The attribute prompt learning leverages the attribute prompts for image-attribute alignment, which enhances the text matching learning. (2) The text matching learning facilitates the representation learning on fine-grained details, and in turn, boosts the attribute prompt learning. Extensive experiments validate the effectiveness of the pre-training on MALS, achieving state-of-the-art retrieval performance via APTM on three challenging real-world benchmarks. In particular, APTM achieves a consistent improvement of +6.60%, +7.39%, and +15.90% Recall@1 accuracy on CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets by a clear margin, respectively.
翻译:本文提出了一个用于文本行人检索的大规模多属性与语言搜索数据集MALS,并探索了在属性识别和图像-文本匹配任务上同时进行预训练的可行性。具体而言,MALS包含1,510,330个图像-文本对,规模约为现有CUHK-PEDES数据集的37.5倍,且所有图像均标注了27种属性。考虑到隐私问题和标注成本,我们利用现成的扩散模型生成该数据集。为验证从生成数据中学习的可行性,我们开发了联合属性提示学习与文本匹配学习(APTM)框架,该框架充分利用属性与文本之间的共享知识。顾名思义,APTM包含属性提示学习流和文本匹配学习流:(1)属性提示学习利用属性提示实现图像-属性对齐,从而增强文本匹配学习;(2)文本匹配学习促进细粒度细节的表征学习,并反过来提升属性提示学习。大量实验验证了在MALS上进行预训练的有效性,通过APTM在三个具有挑战性的真实基准上取得了最先进的检索性能。具体而言,APTM在CUHK-PEDES、ICFG-PEDES和RSTPReid数据集上分别实现了+6.60%、+7.39%和+15.90%的Recall@1精度提升,性能优势显著。