Multimodal learning, particularly for pedestrian detection, has recently received emphasis due to its capability to function equally well in several critical autonomous driving scenarios such as low-light, night-time, and adverse weather conditions. However, in most cases, the training distribution largely emphasizes the contribution of one specific input that makes the network biased towards one modality. Hence, the generalization of such models becomes a significant problem where the non-dominant input modality during training could be contributing more to the course of inference. Here, we introduce a novel training setup with regularizer in the multimodal architecture to resolve the problem of this disparity between the modalities. Specifically, our regularizer term helps to make the feature fusion method more robust by considering both the feature extractors equivalently important during the training to extract the multimodal distribution which is referred to as removing the imbalance problem. Furthermore, our decoupling concept of output stream helps the detection task by sharing the spatial sensitive information mutually. Extensive experiments of the proposed method on KAIST and UTokyo datasets shows improvement of the respective state-of-the-art performance.
翻译:多模态学习,特别是针对行人检测,近期因其在低光照、夜间及恶劣天气等关键自动驾驶场景中均能良好工作而受到重视。然而,在大多数情况下,训练数据分布很大程度上强调某一特定输入的贡献,导致网络偏向于一种模态。因此,此类模型的泛化性成为一个重要问题,因为在训练中非主导的输入模态可能在推理过程中贡献更大。本文中,我们提出了一种新颖的带正则化器的多模态架构训练方案,以解决模态间的这种差异问题。具体而言,我们的正则化项通过将各特征提取器在训练中视为同等重要来提取多模态分布,从而使特征融合方法更加鲁棒,这被称为消除不平衡问题。此外,我们的输出流解耦概念通过相互共享空间敏感信息帮助检测任务。在KAIST和UTokyo数据集上的大量实验表明,所提方法改进了各自的最新性能。