Multispectral pedestrian detection is an important task for many around-the-clock applications, since the visible and thermal modalities can provide complementary information especially under low light conditions. Most of the available multispectral pedestrian detectors are based on non-end-to-end detectors, while in this paper, we propose MultiSpectral pedestrian DEtection TRansformer (MS-DETR), an end-to-end multispectral pedestrian detector, which extends DETR into the field of multi-modal detection. MS-DETR consists of two modality-specific backbones and Transformer encoders, followed by a multi-modal Transformer decoder, and the visible and thermal features are fused in the multi-modal Transformer decoder. To well resist the misalignment between multi-modal images, we design a loosely coupled fusion strategy by sparsely sampling some keypoints from multi-modal features independently and fusing them with adaptively learned attention weights. Moreover, based on the insight that not only different modalities, but also different pedestrian instances tend to have different confidence scores to final detection, we further propose an instance-aware modality-balanced optimization strategy, which preserves visible and thermal decoder branches and aligns their predicted slots through an instance-wise dynamic loss. Our end-to-end MS-DETR shows superior performance on the challenging KAIST, CVC-14 and LLVIP benchmark datasets. The source code is available at https://github.com/YinghuiXing/MS-DETR .
翻译:多光谱行人检测是许多全天候应用中的重要任务,因为可见光与热红外模态能提供互补信息,尤其在低光照条件下。现有的大多数多光谱行人检测器基于非端到端检测器,而本文提出多光谱行人检测Transformer(MS-DETR),一种端到端的多光谱行人检测器,将DETR扩展到多模态检测领域。MS-DETR由两个模态特定的主干网络和Transformer编码器组成,后接多模态Transformer解码器,可见光与热红外特征在该解码器中融合。为有效应对多模态图像间的错位问题,我们设计了一种松耦合融合策略,独立地从多模态特征中稀疏采样若干关键点,并通过自适应学习的注意力权重进行融合。此外,基于不同模态以及不同行人实例对最终检测结果往往具有不同置信度的洞察,我们进一步提出一种实例感知的模态平衡优化策略,该策略保留可见光与热红外解码器分支,并通过实例级动态损失对齐其预测槽位。我们的端到端MS-DETR在具有挑战性的KAIST、CVC-14和LLVIP基准数据集上展现了卓越性能。源代码见 https://github.com/YinghuiXing/MS-DETR 。