Visible-infrared person re-identification is challenging due to the large modality gap. To bridge the gap, most studies heavily rely on the correlation of visible-infrared holistic person images, which may perform poorly under severe distribution shifts. In contrast, we find that some cross-modal correlated high-frequency components contain discriminative visual patterns and are less affected by variations such as wavelength, pose, and background clutter than holistic images. Therefore, we are motivated to bridge the modality gap based on such high-frequency components, and propose \textbf{Proto}type-guided \textbf{H}igh-frequency \textbf{P}atch \textbf{E}nhancement (ProtoHPE) with two core designs. \textbf{First}, to enhance the representation ability of cross-modal correlated high-frequency components, we split patches with such components by Wavelet Transform and exponential moving average Vision Transformer (ViT), then empower ViT to take the split patches as auxiliary input. \textbf{Second}, to obtain semantically compact and discriminative high-frequency representations of the same identity, we propose Multimodal Prototypical Contrast. To be specific, it hierarchically captures the comprehensive semantics of different modal instances, facilitating the aggregation of high-frequency representations belonging to the same identity. With it, ViT can capture key high-frequency components during inference without relying on ProtoHPE, thus bringing no extra complexity. Extensive experiments validate the effectiveness of ProtoHPE.
翻译:可见光-红外行人重识别因模态差异大而极具挑战性。为弥合这一差距,大多数研究严重依赖可见光-红外整体行人图像的关联性,但此类方法在严重分布偏移下表现欠佳。相比之下,我们发现某些跨模态相关的高频分量包含判别性视觉模式,且相较于整体图像受波长、姿态及背景杂波等变化的影响更小。因此,我们基于此类高频分量弥合模态差异,提出**原**型引导**高**频**块**增**强**(ProtoHPE),其包含两项核心设计。**第一**,为增强跨模态相关高频分量的表征能力,我们通过小波变换和指数移动平均视觉Transformer(ViT)分割包含此类分量的图像块,进而使ViT将分割后的图像块作为辅助输入。**第二**,为获得同一身份在语义上紧凑且具判别性的高频表征,我们提出多模态原型对比。具体而言,该方法层次化捕捉不同模态实例的完整语义,促进属于同一身份的高频表征聚合。借此,ViT可在推理阶段不依赖ProtoHPE自主捕获关键高频分量,从而不增加额外复杂度。大量实验验证了ProtoHPE的有效性。