Text-based Person Search (TPS), is targeted on retrieving pedestrians to match text descriptions instead of query images. Recent Vision-Language Pre-training (VLP) models can bring transferable knowledge to downstream TPS tasks, resulting in more efficient performance gains. However, existing TPS methods improved by VLP only utilize pre-trained visual encoders, neglecting the corresponding textual representation and breaking the significant modality alignment learned from large-scale pre-training. In this paper, we explore the full utilization of textual potential from VLP in TPS tasks. We build on the proposed VLP-TPS baseline model, which is the first TPS model with both pre-trained modalities. We propose the Multi-Integrity Description Constraints (MIDC) to enhance the robustness of the textual modality by incorporating different components of fine-grained corpus during training. Inspired by the prompt approach for zero-shot classification with VLP models, we propose the Dynamic Attribute Prompt (DAP) to provide a unified corpus of fine-grained attributes as language hints for the image modality. Extensive experiments show that our proposed TPS framework achieves state-of-the-art performance, exceeding the previous best method by a margin.
翻译:基于文本的行人搜索(TPS)旨在检索与文本描述匹配的行人,而非使用查询图像。最近的视觉-语言预训练(VLP)模型能够为下游TPS任务带来可迁移的知识,从而更有效地提升性能。然而,现有借助VLP改进的TPS方法仅利用了预训练的视觉编码器,忽视了相应的文本表示,破坏了从大规模预训练中学习到的关键模态对齐。本文探索了在TPS任务中充分利用VLP中的文本潜力。我们基于所提出的VLP-TPS基线模型构建,该模型是首个同时包含两种预训练模态的TPS模型。我们提出了多完整性描述约束(MIDC),通过在训练中融入细粒度语料库的不同成分来增强文本模态的鲁棒性。受VLP模型在零样本分类中采用提示方法的启发,我们提出了动态属性提示(DAP),为图像模态提供统一的细粒度属性语言提示。大量实验表明,我们提出的TPS框架达到了最先进性能,显著优于此前最佳方法。