With rich temporal-spatial information, video-based person re-identification methods have shown broad prospects. Although tracklets can be easily obtained with ready-made tracking models, annotating identities is still expensive and impractical. Therefore, some video-based methods propose using only a few identity annotations or camera labels to facilitate feature learning. They also simply average the frame features of each tracklet, overlooking unexpected variations and inherent identity consistency within tracklets. In this paper, we propose the Self-Supervised Refined Clustering (SSR-C) framework without relying on any annotation or auxiliary information to promote unsupervised video person re-identification. Specifically, we first propose the Noise-Filtered Tracklet Partition (NFTP) module to reduce the feature bias of tracklets caused by noisy tracking results, and sequentially partition the noise-filtered tracklets into "sub-tracklets". Then, we cluster and further merge sub-tracklets using the self-supervised signal from tracklet partition, which is enhanced through a progressive strategy to generate reliable pseudo labels, facilitating intra-class cross-tracklet aggregation. Moreover, we propose the Class Smoothing Classification (CSC) loss to efficiently promote model learning. Extensive experiments on the MARS and DukeMTMC-VideoReID datasets demonstrate that our proposed SSR-C for unsupervised video person re-identification achieves state-of-the-art results and is comparable to advanced supervised methods.
翻译:基于视频的行人重识别方法凭借丰富的时空信息展现出广阔前景。尽管利用现成的跟踪模型可轻松获取轨迹片段,但身份标注仍然成本高昂且不切实际。因此,部分基于视频的方法提出仅使用少量身份标注或相机标签来辅助特征学习。这些方法通常简单地对每个轨迹片段的帧特征进行平均处理,忽视了轨迹片段内意外的变化及其固有的身份一致性。本文提出不依赖任何标注或辅助信息的自监督精细化聚类(SSR-C)框架,以推动无监督视频行人重识别的发展。具体而言,我们首先提出噪声过滤轨迹片段划分(NFTP)模块,以降低由噪声跟踪结果引起的轨迹片段特征偏差,并依次将过滤后的轨迹片段划分为“子轨迹片段”。随后,我们利用轨迹片段划分产生的自监督信号对子轨迹片段进行聚类与进一步融合,该信号通过渐进式策略增强以生成可靠的伪标签,从而促进类内跨轨迹片段的特征聚合。此外,我们提出类别平滑分类(CSC)损失函数以有效促进模型学习。在MARS和DukeMTMC-VideoReID数据集上的大量实验表明,我们提出的SSR-C方法在无监督视频行人重识别任务中取得了最先进的性能,其效果可与先进的监督学习方法相媲美。