CoT-PL: Visual Chain-of-Thought Reasoning Meets Pseudo-Labeling for Open-Vocabulary Object Detection

Open-vocabulary object detection (OVD) seeks to recognize and localize object categories beyond those seen during training. Recent approaches typically leverage vision-language models (VLMs) to generate pseudo-labels using image-text alignment, allowing detectors to generalize to unseen classes without explicit supervision. However, these methods depend heavily on direct image-text matching, neglecting the intermediate reasoning steps essential for interpreting semantically complex scenes. This results in limited robustness when confronted with crowded or occluded visual contexts. In this paper, we introduce CoT-PL, a new framework that employs structured visual chain-of-thought (CoT) reasoning into the pseudo-labeling process. CoT-PL decomposes object understanding into three interpretable steps: (1) region perception even for unseen objects, (2) category recognition via zero-shot reasoning, and (3) background grounding to separate semantically complex objects. Crucially, the third step naturally motivates our contrastive background learning (CBL) that uses the pre-computed background cues as negatives to promote feature disentanglement between objects and background. In this way, CoT reasoning and CBL form an integrated pipeline tailored to robust pseudo-labeling in crowded or occluded scenes. Notably, in these two settings, our novel-class pseudo-label quality achieves relative improvements of 103.4% and 168.4% over the best prior, respectively. Our extensive experiments demonstrate that CoT-PL achieves +7.7 AP50 on open-vocabulary COCO and +2.9 mask AP on LVIS for novel classes, setting a new state of the art.

翻译：开放词汇目标检测旨在识别和定位训练阶段未见过的物体类别。现有方法通常利用视觉-语言模型，通过图像-文本对齐生成伪标签，使得检测器能够在没有显式监督的情况下泛化到未见类别。然而，这些方法严重依赖直接的图像-文本匹配，忽略了理解语义复杂场景所必需的中间推理步骤，导致其在处理拥挤或遮挡的视觉上下文时鲁棒性有限。本文提出CoT-PL，这是一个将结构化视觉思维链推理融入伪标签标注过程的新框架。CoT-PL将物体理解分解为三个可解释的步骤：(1) 对未见物体的区域感知，(2) 通过零样本推理进行类别识别，以及(3) 背景定位以分离语义复杂的物体。尤为关键的是，第三步自然地启发了我们提出的对比性背景学习，该方法利用预先计算的背景线索作为负样本，以促进物体与背景之间的特征解耦。通过这种方式，思维链推理与对比性背景学习构成了一个集成流程，专为在拥挤或遮挡场景中进行鲁棒的伪标签标注而设计。值得注意的是，在这两种场景下，我们生成的新类别伪标签质量相较于先前最佳方法分别实现了103.4%和168.4%的相对提升。大量实验表明，CoT-PL在开放词汇COCO数据集上对于新类别的AP50指标提升了+7.7，在LVIS数据集上对于新类别的掩码AP指标提升了+2.9，创造了新的最佳性能。