In real-world scenarios, using multiple modalities like visible (RGB) and infrared (IR) can greatly improve the performance of a predictive task such as object detection (OD). Multimodal learning is a common way to leverage these modalities, where multiple modality-specific encoders and a fusion module are used to improve performance. In this paper, we tackle a different way to employ RGB and IR modalities, where only one modality or the other is observed by a single shared vision encoder. This realistic setting requires a lower memory footprint and is more suitable for applications such as autonomous driving and surveillance, which commonly rely on RGB and IR data. However, when learning a single encoder on multiple modalities, one modality can dominate the other, producing uneven recognition results. This work investigates how to efficiently leverage RGB and IR modalities to train a common transformer-based OD vision encoder, while countering the effects of modality imbalance. For this, we introduce a novel training technique to Mix Patches (MiPa) from the two modalities, in conjunction with a patch-wise modality agnostic module, for learning a common representation of both modalities. Our experiments show that MiPa can learn a representation to reach competitive results on traditional RGB/IR benchmarks while only requiring a single modality during inference. Our code is available at: https://github.com/heitorrapela/MiPa.
翻译:在现实场景中,使用可见光(RGB)和红外(IR)等多种模态可以显著提升目标检测(OD)等预测任务的性能。多模态学习是利用这些模态的常见方法,通常采用多个模态特定编码器与融合模块来提升性能。本文探讨了一种不同的RGB与IR模态利用方式,即仅通过单一共享视觉编码器观测其中一种模态。这一现实设定对内存占用要求更低,更适用于自动驾驶与监控等通常依赖RGB与IR数据的应用场景。然而,当在多种模态上训练单一编码器时,某一模态可能主导另一模态,导致识别结果不均衡。本研究探索如何高效利用RGB与IR模态训练基于Transformer的通用OD视觉编码器,同时抵消模态不平衡的影响。为此,我们提出一种新颖的训练技术,将来自两种模态的补丁进行混合(MiPa),并结合逐补丁模态无关模块,以学习两种模态的通用表征。实验表明,MiPa能够学习到具有竞争力的表征,在传统RGB/IR基准测试中取得优异结果,且推理阶段仅需单一模态。代码已开源:https://github.com/heitorrapela/MiPa。