Fusion for Visual-Infrared Person ReID in Real-World Surveillance Using Corrupted Multimodal Data

Visible-infrared person re-identification (V-I ReID) seeks to match images of individuals captured over a distributed network of RGB and IR cameras. The task is challenging due to the significant differences between V and I modalities, especially under real-world conditions, where images are corrupted by, e.g, blur, noise, and weather. Indeed, state-of-art V-I ReID models cannot leverage corrupted modality information to sustain a high level of accuracy. In this paper, we propose an efficient model for multimodal V-I ReID -- named Multimodal Middle Stream Fusion (MMSF) -- that preserves modality-specific knowledge for improved robustness to corrupted multimodal images. In addition, three state-of-art attention-based multimodal fusion models are adapted to address corrupted multimodal data in V-I ReID, allowing to dynamically balance each modality importance. Recently, evaluation protocols have been proposed to assess the robustness of ReID models under challenging real-world scenarios. However, these protocols are limited to unimodal V settings. For realistic evaluation of multimodal (and cross-modal) V-I person ReID models, we propose new challenging corrupted datasets for scenarios where V and I cameras are co-located (CL) and not co-located (NCL). Finally, the benefits of our Masking and Local Multimodal Data Augmentation (ML-MDA) strategy are explored to improve the robustness of ReID models to multimodal corruption. Our experiments on clean and corrupted versions of the SYSU-MM01, RegDB, and ThermalWORLD datasets indicate the multimodal V-I ReID models that are more likely to perform well in real-world operational conditions. In particular, our ML-MDA is an important strategy for a V-I person ReID system to sustain high accuracy and robustness when processing corrupted multimodal images. Also, our multimodal ReID model MMSF outperforms every method under CL and NCL camera scenarios.

翻译：可见光-红外行人重识别（V-I ReID）旨在通过分布式RGB与红外摄像头网络实现个体图像的匹配匹配。由于可见光与红外模态间的显著差异，尤其在真实场景中图像因模糊、噪声、天气等因素受损时，该任务极具挑战性。现有顶尖V-I ReID模型无法有效利用受损模态信息维持高精度。本文提出一种高效的多模态V-I ReID模型——多模态中流融合（MMSF）——通过保留模态特定知识提升对受损多模态图像的鲁棒性。此外，我们改进了三种基于注意力的顶尖多模态融合模型，使其能处理V-I ReID中的受损多模态数据，动态平衡各模态重要性。近期虽已提出评估真实恶劣场景下ReID模型鲁棒性的协议，但仅局限于单模态V场景。为对多模态（及跨模态）V-I行人ReID模型进行真实评估，我们针对可见光与红外摄像头共位（CL）与非共位（NCL）场景构建了新的受损数据集。最后，我们探索了遮蔽与局部多模态数据增强（ML-MDA）策略对提升ReID模型多模态损坏鲁棒性的益处。在SYSU-MM01、RegDB及ThermalWORLD数据集清洁及受损版本上的实验表明：多模态V-I ReID模型更可能在真实运行条件下表现优异。特别地，ML-MDA是V-I行人ReID系统处理受损多模态图像时维持高精度与鲁棒性的关键策略。同时，我们的多模态ReID模型MMSF在CL及NCL摄像头场景下均优于所有对比方法。