Cascaded information enhancement and cross-modal attention feature fusion for multispectral pedestrian detection

Multispectral pedestrian detection is a technology designed to detect and locate pedestrians in Color and Thermal images, which has been widely used in automatic driving, video surveillance, etc. So far most available multispectral pedestrian detection algorithms only achieved limited success in pedestrian detection because of the lacking take into account the confusion of pedestrian information and background noise in Color and Thermal images. Here we propose a multispectral pedestrian detection algorithm, which mainly consists of a cascaded information enhancement module and a cross-modal attention feature fusion module. On the one hand, the cascaded information enhancement module adopts the channel and spatial attention mechanism to perform attention weighting on the features fused by the cascaded feature fusion block. Moreover, it multiplies the single-modal features with the attention weight element by element to enhance the pedestrian features in the single-modal and thus suppress the interference from the background. On the other hand, the cross-modal attention feature fusion module mines the features of both Color and Thermal modalities to complement each other, then the global features are constructed by adding the cross-modal complemented features element by element, which are attentionally weighted to achieve the effective fusion of the two modal features. Finally, the fused features are input into the detection head to detect and locate pedestrians. Extensive experiments have been performed on two improved versions of annotations (sanitized annotations and paired annotations) of the public dataset KAIST. The experimental results show that our method demonstrates a lower pedestrian miss rate and more accurate pedestrian detection boxes compared to the comparison method. Additionally, the ablation experiment also proved the effectiveness of each module designed in this paper.

翻译：多光谱行人检测是一种旨在从彩色和热红外图像中检测并定位行人的技术，已被广泛应用于自动驾驶、视频监控等领域。目前，大多数可用的多光谱行人检测算法在行人检测方面仅取得有限成功，原因在于未充分考虑彩色和热红外图像中行人信息与背景噪声的混淆问题。本文提出一种多光谱行人检测算法，主要由级联信息增强模块和跨模态注意力特征融合模块构成。一方面，级联信息增强模块采用通道与空间注意力机制，对由级联特征融合块融合后的特征进行注意力加权，并将单模态特征与注意力权重逐元素相乘，以增强单模态中的行人特征，从而抑制背景干扰。另一方面，跨模态注意力特征融合模块挖掘彩色和热红外两种模态的特征以相互补充，然后通过逐元素相加构建跨模态互补后的全局特征，这些特征经注意力加权后实现两种模态特征的有效融合。最终，融合后的特征被输入检测头以完成行人的检测与定位。在公开数据集KAIST的两个改进版本标注（净化标注与配对标注）上进行了大量实验。实验结果表明，与对比方法相比，我们的方法展现出更低的行人漏检率和更精确的行人检测框。此外，消融实验也验证了本文所设计各模块的有效性。