Multispectral pedestrian detection achieves better visibility in challenging conditions and thus has a broad application in various tasks, for which both the accuracy and computational cost are of paramount importance. Most existing approaches treat RGB and infrared modalities equally, typically adopting two symmetrical CNN backbones for multimodal feature extraction, which ignores the substantial differences between modalities and brings great difficulty for the reduction of the computational cost as well as effective crossmodal fusion. In this work, we propose a novel and efficient framework named WCCNet that is able to differentially extract rich features of different spectra with lower computational complexity and semantically rearranges these features for effective crossmodal fusion. Specifically, the discrete wavelet transform (DWT) allowing fast inference and training speed is embedded to construct a dual-stream backbone for efficient feature extraction. The DWT layers of WCCNet extract frequency components for infrared modality, while the CNN layers extract spatial-domain features for RGB modality. This methodology not only significantly reduces the computational complexity, but also improves the extraction of infrared features to facilitate the subsequent crossmodal fusion. Based on the well extracted features, we elaborately design the crossmodal rearranging fusion module (CMRF), which can mitigate spatial misalignment and merge semantically complementary features of spatially-related local regions to amplify the crossmodal complementary information. We conduct comprehensive evaluations on KAIST and FLIR benchmarks, in which WCCNet outperforms state-of-the-art methods with considerable computational efficiency and competitive accuracy. We also perform the ablation study and analyze thoroughly the impact of different components on the performance of WCCNet.
翻译:多光谱行人检测在复杂环境下具有更优的可视性,因此在各类任务中具有广泛应用,其精度与计算成本均至关重要。现有方法通常将RGB与红外模态同等对待,普遍采用双对称CNN骨干网络进行多模态特征提取,这不仅忽略了模态间的本质差异,还导致计算成本降低与有效跨模态融合面临巨大困难。本文提出一种新型高效框架WCCNet,能够以较低的计算复杂度差异化提取不同光谱的丰富特征,并通过语义重排实现有效跨模态融合。具体而言,嵌入具有快速推理与训练速度的离散小波变换(DWT)构建双流骨干网络以实现高效特征提取。WCCNet的DWT层提取红外模态的频率分量,而CNN层提取RGB模态的空间域特征。该方法不仅显著降低计算复杂度,还提升红外特征提取质量以促进后续跨模态融合。基于优质提取特征,我们精心设计了跨模态重排融合模块(CMRF),该模块可缓解空间错位问题,合并空间相关局部区域的语义互补特征,从而放大跨模态互补信息。我们在KAIST和FLIR基准数据集上进行全面评估,结果表明WCCNet在保持显著计算效率与竞争力精度的前提下优于现有最优方法。我们还进行了消融研究,深入分析了不同组件对WCCNet性能的影响。