Multimodal Re-Identification (ReID) is a popular retrieval task that aims to re-identify objects across diverse data streams, prompting many researchers to integrate multiple modalities into a unified representation. While such fusion promises a holistic view, our investigations shed light on potential pitfalls. We uncover that prevailing late-fusion techniques often produce suboptimal latent representations when compared to methods that train modalities in isolation. We argue that this effect is largely due to the inadvertent relaxation of the training objectives on individual modalities when using fusion, what others have termed modality laziness. We present a nuanced point-of-view that this relaxation can lead to certain modalities failing to fully harness available task-relevant information, and yet, offers a protective veil to noisy modalities, preventing them from overfitting to task-irrelevant data. Our findings also show that unimodal concatenation (UniCat) and other late-fusion ensembling of unimodal backbones, when paired with best-known training techniques, exceed the current state-of-the-art performance across several multimodal ReID benchmarks. By unveiling the double-edged sword of "modality laziness", we motivate future research in balancing local modality strengths with global representations.
翻译:摘要:多模态再识别(ReID)是一项流行的检索任务,旨在跨多样化的数据流重新识别对象,促使众多研究者将多种模态整合为统一表示。尽管这种融合有望提供整体视角,但我们的研究揭示了潜在陷阱。我们发现,与单独训练各模态的方法相比,流行的后期融合技术通常产生次优的潜在表示。我们认为,这种效应主要源于融合时无意中放宽了个体模态的训练目标,即所谓的“模态惰性”。我们提出一种细致入微的观点:这种放宽虽可能导致某些模态未能充分利用任务相关信息,却为噪声模态提供了保护屏障,防止其过拟合于任务无关数据。我们的研究还表明,单模态拼接(UniCat)及其他单模态骨干网络的后期融合集成方法,在配合最佳已知训练技术时,在多个多模态ReID基准上超越了现有最优性能。通过揭示“模态惰性”的双刃剑特性,我们激励未来研究在局部模态优势与全局表示之间取得平衡。