In this paper, we introduce a large Multi-Attribute and Language Search dataset for text-based person retrieval, called MALS, and explore the feasibility of performing pre-training on both attribute recognition and image-text matching tasks in one stone. In particular, MALS contains 1,510,330 image-text pairs, which is about 37.5 times larger than prevailing CUHK-PEDES, and all images are annotated with 27 attributes. Considering the privacy concerns and annotation costs, we leverage the off-the-shelf diffusion models to generate the dataset. To verify the feasibility of learning from the generated data, we develop a new joint Attribute Prompt Learning and Text Matching Learning (APTM) framework, considering the shared knowledge between attribute and text. As the name implies, APTM contains an attribute prompt learning stream and a text matching learning stream. (1) The attribute prompt learning leverages the attribute prompts for image-attribute alignment, which enhances the text matching learning. (2) The text matching learning facilitates the representation learning on fine-grained details, and in turn, boosts the attribute prompt learning. Extensive experiments validate the effectiveness of the pre-training on MALS, achieving state-of-the-art retrieval performance via APTM on three challenging real-world benchmarks. In particular, APTM achieves a consistent improvement of +6.96%, +7.68%, and +16.95% Recall@1 accuracy on CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets by a clear margin, respectively.
翻译:本文提出了一种用于文本式行人检索的大规模多属性与语言搜索数据集,命名为MALS,并探讨了同时进行属性识别和图像-文本匹配任务预训练的可行性。具体而言,MALS包含1,510,330个图像-文本对,规模约为现有主流数据集CUHK-PEDES的37.5倍,且所有图像均标注了27种属性。考虑到隐私问题和标注成本,我们利用现有扩散模型生成该数据集。为验证从生成数据中学习的可行性,我们提出了一种联合属性提示学习与文本匹配学习框架APTM,该框架充分考虑属性与文本间的共享知识。顾名思义,APTM包含属性提示学习流和文本匹配学习流:(1)属性提示学习利用属性提示实现图像-属性对齐,从而增强文本匹配学习;(2)文本匹配学习促进细粒度细节的表征学习,进而提升属性提示学习。大量实验验证了在MALS上预训练的有效性,通过APTM在三个具有挑战性的真实世界基准上取得了最先进的检索性能。具体而言,APTM在CUHK-PEDES、ICFG-PEDES和RSTPReid数据集上分别实现了+6.96%、+7.68%和+16.95%的Recall@1准确率显著提升。