In this paper, we present a different way to use two modalities, in which either one modality or the other is seen by a single model. This can be useful when adapting an unimodal model to leverage more information while respecting a limited computational budget. This would mean having a single model that is able to deal with any modalities. To describe this, we coined the term anymodal learning. An example of this, is a use case where, surveillance in a room when the lights are off would be much more valuable using an infrared modality while a visible one would provide more discriminative information when lights are on. This work investigates how to efficiently leverage visible and infrared/thermal modalities for transformer-based object detection backbone to create an anymodal architecture. Our work does not create any inference overhead during the testing while exploring an effective way to exploit the two modalities during the training. To accomplish such a task, we introduce the novel anymodal training technique: Mixed Patches (MiPa), in conjunction with a patch-wise domain agnostic module, which is responsible of learning the best way to find a common representation of both modalities. This approach proves to be able to balance modalities by reaching competitive results on individual modality benchmarks with the alternative of using an unimodal architecture on three different visible-infrared object detection datasets. Finally, our proposed method, when used as a regularization for the strongest modality, can beat the performance of multimodal fusion methods while only requiring a single modality during inference. Notably, MiPa became the state-of-the-art on the LLVIP visible/infrared benchmark. Code: https://github.com/heitorrapela/MiPa
翻译:本文提出了一种利用两种模态的新方式,即单个模型仅处理其中一种模态。当需要调整单模态模型以在有限计算预算下利用更多信息时,该方法十分有效。这意味着构建一个能够处理任意模态的单一模型,为此我们提出“任意模态学习”这一术语。例如,在房间关灯场景下,红外模态的监控更具价值,而开灯时可见光模态能提供更具区分性的信息。本文探究如何高效利用可见光与红外/热成像模态,基于Transformer目标检测骨干网络构建任意模态架构。所提方法在测试时不引入任何额外推理开销,同时探索了训练阶段有效利用两种模态的途径。为实现此目标,我们提出新型任意模态训练技术——混合补丁(MiPa),并联合设计了一种补丁级域无关模块,负责学习寻找两种模态公共表示的最优方式。该方法能够平衡模态性能:在三种不同可见光-红外目标检测数据集上,相较于单模态架构,其在单一模态基准测试中取得了具有竞争力的结果。此外,当将所提方法作为强模态的正则化手段时,其性能可超越多模态融合方法,且推理阶段仅需单一模态。值得注意的是,MiPa在LLVIP可见光/红外基准测试中达到了最先进水平。代码:https://github.com/heitorrapela/MiPa