Multispectral pedestrian detection is an important task for many around-the-clock applications, since the visible and thermal modalities can provide complementary information especially under low light conditions. To reduce the influence of hand-designed components in available multispectral pedestrian detectors, we propose a MultiSpectral pedestrian DEtection TRansformer (MS-DETR), which extends deformable DETR to multi-modal paradigm. In order to facilitate the multi-modal learning process, a Reference box Constrained Cross-Attention (RCCA) module is firstly introduced to the multi-modal Transformer decoder, which takes fusion branch together with the reference boxes as intermediaries to enable the interaction of visible and thermal modalities. To further balance the contribution of different modalities, we design a modality-balanced optimization strategy, which aligns the slots of decoders by adaptively adjusting the instance-level weight of three branches. Our end-to-end MS-DETR shows superior performance on the challenging KAIST and CVC-14 benchmark datasets.
翻译:多光谱行人检测在许多全天候应用中是一项重要任务,因为可见光和热成像模态可提供互补信息,尤其在低光照条件下。为减少现有多光谱行人检测器中手工设计组件的影响,我们提出了一种多光谱行人检测变换器(MS-DETR),将可变形DETR扩展至多模态范式。为促进多模态学习过程,本文首先在多模态Transformer解码器中引入参考框约束交叉注意力(RCCA)模块,该模块以融合分支与参考框作为中介,实现可见光与热成像模态的交互。为进一步平衡不同模态的贡献,我们设计了一种模态平衡优化策略,通过自适应调整三个分支的实例级权重来对齐解码器槽位。我们的端到端MS-DETR在具有挑战性的KAIST和CVC-14基准数据集上展示了优越性能。