Multispectral pedestrian detection is attractive for around-the-clock applications due to the complementary information between RGB and thermal modalities. However, current models often fail to detect pedestrians in obvious cases, especially due to the modality bias learned from statistically biased datasets. From these problems, we anticipate that maybe understanding the complementary information itself is difficult to achieve from vision-only models. Accordingly, we propose a novel Multispectral Chain-of-Thought Detection (MSCoTDet) framework, which incorporates Large Language Models (LLMs) to understand the complementary information at the semantic level and further enhance the fusion process. Specifically, we generate text descriptions of the pedestrian in each RGB and thermal modality and design a Multispectral Chain-of-Thought (MSCoT) prompting, which models a step-by-step process to facilitate cross-modal reasoning at the semantic level and perform accurate detection. Moreover, we design a Language-driven Multi-modal Fusion (LMF) strategy that enables fusing vision-driven and language-driven detections. Extensive experiments validate that MSCoTDet improves multispectral pedestrian detection.
翻译:多光谱行人检测因RGB与热红外模态间的互补信息,在全天候应用中具有吸引力。然而,当前模型常因统计偏置数据集导致的模态偏差而难以检测明显情况下的行人。基于这些问题,我们推测仅依靠纯视觉模型可能难以理解互补信息本身。为此,我们提出一种新颖的多光谱思维链检测(MSCoTDet)框架,该框架引入大语言模型(LLMs)在语义层面理解互补信息,并进一步增强融合过程。具体而言,我们为RGB和热红外模态中的行人生成文本描述,并设计多光谱思维链(MSCoT)提示方法,通过建模逐步推理过程促进语义层面的跨模态推理,从而实现精确检测。此外,我们提出一种语言驱动的多模态融合(LMF)策略,能够融合视觉驱动与语言驱动的检测结果。大量实验验证了MSCoTDet在多光谱行人检测中的有效性。