Large language models (LLMs) have shown their capability in understanding contextual and semantic information regarding appearance knowledge of instances. In this paper, we introduce a novel approach to utilize the strength of an LLM in understanding contextual appearance variations and to leverage its knowledge into a vision model (here, pedestrian detection). While pedestrian detection is considered one of crucial tasks directly related with our safety (e.g., intelligent driving system), it is challenging because of varying appearances and poses in diverse scenes. Therefore, we propose to formulate language-driven appearance knowledge units and incorporate them with visual cues in pedestrian detection. To this end, we establish description corpus which includes numerous narratives describing various appearances of pedestrians and others. By feeding them through an LLM, we extract appearance knowledge sets that contain the representations of appearance variations. After that, we perform a task-prompting process to obtain appearance knowledge units which are representative appearance knowledge guided to be relevant to a downstream pedestrian detection task. Finally, we provide plentiful appearance information by integrating the language-driven knowledge units with visual cues. Through comprehensive experiments with various pedestrian detectors, we verify the effectiveness of our method showing noticeable performance gains and achieving state-of-the-art detection performance.
翻译:大型语言模型(LLMs)展现了其在理解实例外观知识的上下文和语义信息方面的能力。本文提出了一种新颖的方法,利用LLM在理解上下文外观变化方面的优势,并将其知识融入视觉模型(此处为行人检测)。尽管行人检测被视为与人类安全直接相关的关键任务之一(例如智能驾驶系统),但由于不同场景中外观和姿态的多样性,该任务具有挑战性。因此,我们提出构建语言驱动的外观知识单元,并将其与视觉线索结合应用于行人检测。为此,我们建立了一个描述语料库,包含大量描述行人与其他对象不同外观的叙述。通过将语料库输入LLM,我们提取出包含外观变化表示的外观知识集。随后,通过任务提示过程,获得与下游行人检测任务相关且具有代表性的外观知识单元。最后,通过将语言驱动的知识单元与视觉线索融合,提供丰富的行人外观信息。通过使用多种行人检测器进行综合实验,我们验证了所提方法的有效性,展现了显著的性能提升,并达到了最先进的检测性能。