Multispectral pedestrian detection has been shown to be effective in improving performance within complex illumination scenarios. However, prevalent double-stream networks in multispectral detection employ two separate feature extraction branches for multi-modal data, leading to nearly double the inference time compared to single-stream networks utilizing only one feature extraction branch. This increased inference time has hindered the widespread employment of multispectral pedestrian detection in embedded devices for autonomous systems. To address this limitation, various knowledge distillation methods have been proposed. However, traditional distillation methods focus only on the fusion features and ignore the large amount of information in the original multi-modal features, thereby restricting the student network's performance. To tackle the challenge, we introduce the Adaptive Modal Fusion Distillation (AMFD) framework, which can fully utilize the original modal features of the teacher network. Specifically, a Modal Extraction Alignment (MEA) module is utilized to derive learning weights for student networks, integrating focal and global attention mechanisms. This methodology enables the student network to acquire optimal fusion strategies independent from that of teacher network without necessitating an additional feature fusion module. Furthermore, we present the SMOD dataset, a well-aligned challenging multispectral dataset for detection. Extensive experiments on the challenging KAIST, LLVIP and SMOD datasets are conducted to validate the effectiveness of AMFD. The results demonstrate that our method outperforms existing state-of-the-art methods in both reducing log-average Miss Rate and improving mean Average Precision. The code is available at https://github.com/bigD233/AMFD.git.
翻译:多光谱行人检测已被证明能在复杂光照场景下有效提升性能。然而,当前多光谱检测中广泛采用的双流网络为处理多模态数据配备了两条独立特征提取分支,其推理时间相比仅使用单条特征提取分支的单流网络近乎翻倍。这种推理时间的增加阻碍了多光谱行人检测在嵌入式设备中用于自主系统的广泛部署。为解决这一限制,研究者提出了多种知识蒸馏方法。但传统蒸馏方法仅关注融合特征,忽略了原始多模态特征中包含的大量信息,从而限制了学生网络的性能。为此,我们提出了自适应模态融合蒸馏(AMFD)框架,该框架能够充分利用教师网络的原始模态特征。具体而言,通过模态提取对齐(MEA)模块为学生网络推导学习权重,该模块集成了焦点注意力机制与全局注意力机制。该方法使学生网络能够独立获取优于教师网络的融合策略,且无需额外引入特征融合模块。此外,我们提出了SMOD数据集——一个精心对齐的挑战性多光谱检测数据集。通过在KAIST、LLVIP和SMOD三个挑战性数据集上进行的广泛实验,验证了AMFD的有效性。结果表明,我们的方法在降低对数平均漏检率和提升平均精度均值方面均优于现有最优方法。代码已开源至https://github.com/bigD233/AMFD.git。