WCCNet: Wavelet-context Cooperative Network for Efficient Multispectral Pedestrian Detection

Multispectral pedestrian detection achieves better visibility in challenging conditions and thus is essential to autonomous driving, for which both the accuracy and computational cost are of paramount importance. Most existing approaches treat RGB and infrared modalities equally. They typically adopt two symmetrical backbones for multimodal feature extraction, which ignore the substantial differences between modalities and bring great difficulty for the reduction of the computational cost as well as effective crossmodal fusion. In this work, we propose a novel and efficient framework named Wavelet-context Cooperative Network (WCCNet) that is able to differentially extract complementary features of different spectra with lower computational complexity, and further fuse these diverse features based on their spatially relevant crossmodal semantics. In particular, WCCNet simultaneously explore wavelet context and RGB textures within a cooperative dual-stream backbone, which is composed of adaptive discrete wavelet transform (ADWT) layers and heavyweight neural layers. The ADWT layers extract frequency components for infrared modality, while neural layers handle RGB modality features. Since ADWT layers are lightweight and extract complementary features, this cooperative structure not only significantly reduces the computational complexity, but also facilitates the subsequent crossmodal fusion. To further fuse these infrared and RGB features with significant semantic differences, we elaborately design the crossmodal rearranging fusion module (CMRF), which can mitigate spatial misalignment and merge semantically complementary features in spatially-related local regions to amplify the crossmodal reciprocal information. Experimental results on KAIST and FLIR benchmarks indicate that WCCNet outperforms state-of-the-art methods with considerable efficiency and competitive accuracy.

翻译：多光谱行人检测在复杂环境下具有更好的可见性，因此对自动驾驶至关重要，其中检测精度与计算成本均具有极高重要性。现有方法大多对RGB与红外模态进行平等处理，通常采用对称的双主干网络进行多模态特征提取，忽略了模态间的显著差异，这既增加了计算成本削减的难度，也阻碍了有效的跨模态融合。本文提出一种新颖高效的小波-上下文协同网络（WCCNet），能够以较低计算复杂度差异化提取不同光谱的互补特征，并基于空间相关的跨模态语义进一步融合这些异构特征。具体而言，WCCNet在协同双流主干网络中同时探索小波上下文与RGB纹理特征，该主干由自适应离散小波变换（ADWT）层与重型神经网络层构成。ADWT层为红外模态提取频率分量，而神经网络层处理RGB模态特征。由于ADWT层具有轻量化特性且能提取互补特征，该协同结构不仅显著降低了计算复杂度，还促进了后续跨模态融合。为进一步融合具有显著语义差异的红外与RGB特征，我们精心设计了跨模态重排融合模块（CMRF），该模块能够缓解空间错位问题，在空间相关的局部区域合并语义互补特征，从而增强跨模态互惠信息。在KAIST和FLIR基准数据集上的实验结果表明，WCCNet以显著的高效性和具有竞争力的精度优于现有最先进方法。