Integrating Language-Derived Appearance Elements with Visual Cues in Pedestrian Detection

Large language models (LLMs) have shown their capabilities in understanding contextual and semantic information regarding knowledge of instance appearances. In this paper, we introduce a novel approach to utilize the strengths of LLMs in understanding contextual appearance variations and to leverage this knowledge into a vision model (here, pedestrian detection). While pedestrian detection is considered one of the crucial tasks directly related to our safety (e.g., intelligent driving systems), it is challenging because of varying appearances and poses in diverse scenes. Therefore, we propose to formulate language-derived appearance elements and incorporate them with visual cues in pedestrian detection. To this end, we establish a description corpus that includes numerous narratives describing various appearances of pedestrians and other instances. By feeding them through an LLM, we extract appearance knowledge sets that contain the representations of appearance variations. Subsequently, we perform a task-prompting process to obtain appearance elements which are guided representative appearance knowledge relevant to a downstream pedestrian detection task. The obtained knowledge elements are adaptable to various detection frameworks, so that we can provide plentiful appearance information by integrating the language-derived appearance elements with visual cues within a detector. Through comprehensive experiments with various pedestrian detectors, we verify the adaptability and effectiveness of our method showing noticeable performance gains and achieving state-of-the-art detection performance on two public pedestrian detection benchmarks (i.e., CrowdHuman and WiderPedestrian).

翻译：大型语言模型（LLMs）在理解与实例外观知识相关的上下文和语义信息方面已展现出其能力。本文提出了一种新颖方法，利用LLMs理解上下文外观变化的优势，并将该知识融入视觉模型（此处为行人检测）中。尽管行人检测被视为直接关系到我们安全（例如智能驾驶系统）的关键任务之一，但由于不同场景中外观和姿态的多样性，该任务颇具挑战性。因此，我们提出构建语言导出的外观元素，并将其与视觉线索结合用于行人检测。为此，我们建立了一个描述语料库，其中包含描述行人与其他实例各种外观的众多叙述。通过将这些叙述输入LLM，我们提取出包含外观变化表征的外观知识集。随后，我们执行任务提示过程，以获取与下游行人检测任务相关的引导性代表性外观知识元素。所得知识元素可适应多种检测框架，从而我们能够通过将语言导出的外观元素与检测器内的视觉线索集成，提供丰富的外观信息。通过使用多种行人检测器进行综合实验，我们验证了该方法的适应性和有效性，显示出显著的性能提升，并在两个公开行人检测基准（即CrowdHuman和WiderPedestrian）上实现了最先进的检测性能。