Integrating Language-Derived Appearance Elements with Visual Cues in Pedestrian Detection

Large language models (LLMs) have shown their capabilities in understanding contextual and semantic information regarding knowledge of instance appearances. In this paper, we introduce a novel approach to utilize the strengths of LLMs in understanding contextual appearance variations and to leverage this knowledge into a vision model (here, pedestrian detection). While pedestrian detection is considered one of the crucial tasks directly related to our safety (e.g., intelligent driving systems), it is challenging because of varying appearances and poses in diverse scenes. Therefore, we propose to formulate language-derived appearance elements and incorporate them with visual cues in pedestrian detection. To this end, we establish a description corpus that includes numerous narratives describing various appearances of pedestrians and other instances. By feeding them through an LLM, we extract appearance knowledge sets that contain the representations of appearance variations. Subsequently, we perform a task-prompting process to obtain appearance elements which are guided representative appearance knowledge relevant to a downstream pedestrian detection task. The obtained knowledge elements are adaptable to various detection frameworks, so that we can provide plentiful appearance information by integrating the language-derived appearance elements with visual cues within a detector. Through comprehensive experiments with various pedestrian detectors, we verify the adaptability and effectiveness of our method showing noticeable performance gains and achieving state-of-the-art detection performance on two public pedestrian detection benchmarks (i.e., CrowdHuman and WiderPedestrian).

翻译：大型语言模型（LLMs）已展现出理解实例外观相关的上下文与语义信息的能力。本文提出一种新颖方法，利用LLMs理解上下文外观变化的优势，并将该知识融入视觉模型（即行人检测）。尽管行人检测被视为直接关乎安全的关键任务（如智能驾驶系统），但由于不同场景中外貌和姿态的多样性，其仍具挑战性。为此，我们建议构建语言导出的外观要素，并将其与视觉线索结合用于行人检测。具体而言，我们建立了一个描述语料库，包含描述行人及其他实例多种外观的大量叙述。通过LLM处理这些语料，我们提取出包含外观变化表征的外观知识集。随后，通过任务提示过程，获取与下游行人检测任务相关的代表性外观知识所引导的外观要素。所得知识要素可适配多种检测框架，从而通过将语言导出的外观要素与视觉线索集成至检测器中，提供丰富的表象信息。通过在多种行人检测器上的综合实验，我们验证了该方法在显著提升性能方面的适应性与有效性，并在两个公开行人检测基准（即CrowdHuman和WiderPedestrian）上实现了最佳检测性能。