We propose unsupervised multi-scenario (UMS) person re-identification (ReID) as a new task that expands ReID across diverse scenarios (cross-resolution, clothing change, etc.) within a single coherent framework. To tackle UMS-ReID, we introduce image-text knowledge modeling (ITKM) -- a three-stage framework that effectively exploits the representational power of vision-language models. We start with a pre-trained CLIP model with an image encoder and a text encoder. In Stage I, we introduce a scenario embedding in the image encoder and fine-tune the encoder to adaptively leverage knowledge from multiple scenarios. In Stage II, we optimize a set of learned text embeddings to associate with pseudo-labels from Stage I and introduce a multi-scenario separation loss to increase the divergence between inter-scenario text representations. In Stage III, we first introduce cluster-level and instance-level heterogeneous matching modules to obtain reliable heterogeneous positive pairs (e.g., a visible image and an infrared image of the same person) within each scenario. Next, we propose a dynamic text representation update strategy to maintain consistency between text and image supervision signals. Experimental results across multiple scenarios demonstrate the superiority and generalizability of ITKM; it not only outperforms existing scenario-specific methods but also enhances overall performance by integrating knowledge from multiple scenarios.
翻译:本文提出无监督多场景行人重识别这一新任务,旨在将行人重识别扩展至多样化场景(如跨分辨率、衣物更换等),并构建统一的处理框架。为应对该任务,我们提出图像-文本知识建模框架——一种三阶段方法,能有效利用视觉-语言模型的表征能力。我们以预训练的CLIP模型为基础,该模型包含图像编码器与文本编码器。在第一阶段,我们在图像编码器中引入场景嵌入表示,并通过微调使编码器能自适应地融合多场景知识。第二阶段,我们优化一组可学习的文本嵌入向量,使其与第一阶段生成的伪标签相关联,并设计多场景分离损失以增强场景间文本表征的差异性。第三阶段,我们首先提出簇级与实例级异质匹配模块,以在各场景内获取可靠的异质正样本对(例如同一人的可见光图像与红外图像)。随后,我们提出动态文本表征更新策略,以保持文本监督信号与图像监督信号的一致性。在多场景数据集上的实验结果表明,ITKM方法具有优越性与泛化能力:其不仅超越了现有针对特定场景的方法,还能通过整合多场景知识实现整体性能的提升。