This work aims to adapt large-scale pre-trained vision-language models, such as contrastive language-image pretraining (CLIP), to enhance the performance of object reidentification (Re-ID) across various supervision settings. Although prompt learning has enabled a recent work named CLIP-ReID to achieve promising performance, the underlying mechanisms and the necessity of prompt learning remain unclear due to the absence of semantic labels in ReID tasks. In this work, we first analyze the role prompt learning in CLIP-ReID and identify its limitations. Based on our investigations, we propose a simple yet effective approach to adapt CLIP for supervised object Re-ID. Our approach directly fine-tunes the image encoder of CLIP using a prototypical contrastive learning (PCL) loss, eliminating the need for prompt learning. Experimental results on both person and vehicle Re-ID datasets demonstrate the competitiveness of our method compared to CLIP-ReID. Furthermore, we extend our PCL-based CLIP fine-tuning approach to unsupervised scenarios, where we achieve state-of-the art performance. Code is available at https://github.com/RikoLi/PCL-CLIP.
翻译:本研究旨在通过适配大规模预训练的视觉-语言模型(如对比语言-图像预训练模型CLIP),以提升目标重识别任务在不同监督设置下的性能。尽管提示学习技术已使近期提出的CLIP-ReID方法取得了显著效果,但由于重识别任务缺乏语义标签,提示学习的内在机制及其必要性仍不明确。本文首先分析了提示学习在CLIP-ReID中的作用并指出其局限性。基于研究结果,我们提出一种简洁而有效的CLIP适配方法用于监督式目标重识别。该方法通过原型对比学习损失直接微调CLIP的图像编码器,无需依赖提示学习机制。在行人及车辆重识别数据集上的实验结果表明,本方法相较于CLIP-ReID具有竞争优势。此外,我们将基于原型对比学习的CLIP微调框架扩展至无监督场景,在此设定下取得了最先进的性能。代码已开源:https://github.com/RikoLi/PCL-CLIP。