Single-modal object re-identification (ReID) faces great challenges in maintaining robustness within complex visual scenarios. In contrast, multi-modal object ReID utilizes complementary information from diverse modalities, showing great potentials for practical applications. However, previous methods may be easily affected by irrelevant backgrounds and usually ignore the modality gaps. To address above issues, we propose a novel learning framework named \textbf{EDITOR} to select diverse tokens from vision Transformers for multi-modal object ReID. We begin with a shared vision Transformer to extract tokenized features from different input modalities. Then, we introduce a Spatial-Frequency Token Selection (SFTS) module to adaptively select object-centric tokens with both spatial and frequency information. Afterwards, we employ a Hierarchical Masked Aggregation (HMA) module to facilitate feature interactions within and across modalities. Finally, to further reduce the effect of backgrounds, we propose a Background Consistency Constraint (BCC) and an Object-Centric Feature Refinement (OCFR). They are formulated as two new loss functions, which improve the feature discrimination with background suppression. As a result, our framework can generate more discriminative features for multi-modal object ReID. Extensive experiments on three multi-modal ReID benchmarks verify the effectiveness of our methods. The code is available at https://github.com/924973292/EDITOR.
翻译:单模态目标重识别(ReID)在复杂视觉场景中面临保持鲁棒性的巨大挑战。相比之下,多模态目标重识别利用来自不同模态的互补信息,展现出在实际应用中的巨大潜力。然而,以往的方法容易受到无关背景的影响,且通常忽略模态间的差异。为解决上述问题,我们提出一种新颖的学习框架,名为 **EDITOR**,用于从视觉Transformer中选择多样化的标记以实现多模态目标重识别。我们首先采用共享的视觉Transformer从不同输入模态中提取标记化特征。然后,引入空间-频率标记选择模块,自适应地选择包含空间和频率信息的目标中心标记。随后,采用分层掩码聚合模块,促进模态内部及跨模态的特征交互。最后,为减少背景的影响,提出背景一致性约束和目标中心特征精炼方法。这两个方法被设计为新的损失函数,通过抑制背景来提升特征的判别性。因此,我们的框架能够为多模态目标重识别生成更具判别力的特征。在三个多模态ReID基准上的大量实验验证了方法的有效性。代码已开源至 https://github.com/924973292/EDITOR。