Multispectral object detection, utilizing both visible (RGB) and thermal infrared (T) modals, has garnered significant attention for its robust performance across diverse weather and lighting conditions. However, effectively exploiting the complementarity between RGB-T modals while maintaining efficiency remains a critical challenge. In this paper, a very simple Group Shuffled Multi-receptive Attention (GSMA) module is proposed to extract and combine multi-scale RGB and thermal features. Then, the extracted multi-modal features are directly integrated with a multi-level path aggregation neck, which significantly improves the fusion effect and efficiency. Meanwhile, multi-modal object detection often adopts union annotations for both modals. This kind of supervision is not sufficient and unfair, since objects observed in one modal may not be seen in the other modal. To solve this issue, Multi-modal Supervision (MS) is proposed to sufficiently supervise RGB-T object detection. Comprehensive experiments on two challenging benchmarks, KAIST and DroneVehicle, demonstrate the proposed model achieves the state-of-the-art accuracy while maintaining competitive efficiency.
翻译:多光谱目标检测结合可见光(RGB)与热红外(T)模态,因其在不同天气与光照条件下均具有鲁棒性能而受到广泛关注。然而,在保持高效性的同时有效利用RGB-T模态间的互补性仍是一个关键挑战。本文提出了一种极其简洁的分组混洗多感受野注意力(GSMA)模块,用于提取并融合多尺度RGB与热红外特征。随后,提取的多模态特征通过多级路径聚合网络颈部直接集成,显著提升了融合效果与效率。同时,多模态目标检测通常对两种模态采用联合标注。由于物体可能仅在某一种模态中被观测到,此类监督方式既不充分也不公平。为解决该问题,本文提出多模态监督(MS)方法,以对RGB-T目标检测进行充分监督。在KAIST与DroneVehicle两个具有挑战性的基准数据集上的综合实验表明,所提模型在保持竞争力的效率的同时,达到了最先进的检测精度。