Text-based Person Retrieval (TPR) aims to retrieve the target person images given a textual query. The primary challenge lies in bridging the substantial gap between vision and language modalities, especially when dealing with limited large-scale datasets. In this paper, we introduce a CLIP-based Synergistic Knowledge Transfer (CSKT) approach for TPR. Specifically, to explore the CLIP's knowledge on input side, we first propose a Bidirectional Prompts Transferring (BPT) module constructed by text-to-image and image-to-text bidirectional prompts and coupling projections. Secondly, Dual Adapters Transferring (DAT) is designed to transfer knowledge on output side of Multi-Head Attention (MHA) in vision and language. This synergistic two-way collaborative mechanism promotes the early-stage feature fusion and efficiently exploits the existing knowledge of CLIP. CSKT outperforms the state-of-the-art approaches across three benchmark datasets when the training parameters merely account for 7.4% of the entire model, demonstrating its remarkable efficiency, effectiveness and generalization.
翻译:文本行人检索旨在通过文本描述检索目标行人图像。其核心挑战在于弥合视觉与语言模态间的显著差距,尤其在处理有限大规模数据集时。本文提出一种基于CLIP的协同知识迁移方法。具体而言,为探索CLIP在输入端的知识,我们首先设计了一个双向提示迁移模块,该模块由文本到图像与图像到文本的双向提示及耦合投影构成。其次,在视觉与语言的多头注意力输出端引入双适配器迁移模块。这种协同双向交互机制能够促进早期特征融合,并高效利用CLIP的既有知识。在训练参数仅占模型总量7.4%的情况下,CSKT在三个基准数据集上均优于现有方法,展现出卓越的效率、有效性与泛化能力。