Multimodal learning, particularly for pedestrian detection, has recently received emphasis due to its capability to function equally well in several critical autonomous driving scenarios such as low-light, night-time, and adverse weather conditions. However, in most cases, the training distribution largely emphasizes the contribution of one specific input that makes the network biased towards one modality. Hence, the generalization of such models becomes a significant problem where the non-dominant input modality during training could be contributing more to the course of inference. Here, we introduce a novel training setup with regularizer in the multimodal architecture to resolve the problem of this disparity between the modalities. Specifically, our regularizer term helps to make the feature fusion method more robust by considering both the feature extractors equivalently important during the training to extract the multimodal distribution which is referred to as removing the imbalance problem. Furthermore, our decoupling concept of output stream helps the detection task by sharing the spatial sensitive information mutually. Extensive experiments of the proposed method on KAIST and UTokyo datasets shows improvement of the respective state-of-the-art performance.
翻译:多模态学习,特别是用于行人检测,近年来因其在多种关键自动驾驶场景(如低光照、夜间及恶劣天气条件)中均能表现良好而受到重视。然而,大多数情况下,训练分布在很大程度上偏向于某一特定输入的贡献,从而导致网络偏向于某一模态。因此,此类模型的泛化问题变得显著,即训练过程中的非主导输入模态可能在推理阶段贡献更多。为此,我们在多模态架构中引入了一种带有正则化器的新型训练设置,以解决模态间的差异问题。具体而言,我们的正则化项通过将特征提取器在训练过程中视为同等重要,帮助特征融合方法更加鲁棒,从而提取多模态分布,这被称为消除不平衡问题。此外,我们的输出流解耦概念通过相互共享空间敏感信息,有助于检测任务。所提方法在KAIST和UTokyo数据集上的大量实验表明,其改进了各自的最先进性能。