DiPEx: Dispersing Prompt Expansion for Class-Agnostic Object Detection

Class-agnostic object detection (OD) can be a cornerstone or a bottleneck for many downstream vision tasks. Despite considerable advancements in bottom-up and multi-object discovery methods that leverage basic visual cues to identify salient objects, consistently achieving a high recall rate remains difficult due to the diversity of object types and their contextual complexity. In this work, we investigate using vision-language models (VLMs) to enhance object detection via a self-supervised prompt learning strategy. Our initial findings indicate that manually crafted text queries often result in undetected objects, primarily because detection confidence diminishes when the query words exhibit semantic overlap. To address this, we propose a Dispersing Prompt Expansion (DiPEx) approach. DiPEx progressively learns to expand a set of distinct, non-overlapping hyperspherical prompts to enhance recall rates, thereby improving performance in downstream tasks such as out-of-distribution OD. Specifically, DiPEx initiates the process by self-training generic parent prompts and selecting the one with the highest semantic uncertainty for further expansion. The resulting child prompts are expected to inherit semantics from their parent prompts while capturing more fine-grained semantics. We apply dispersion losses to ensure high inter-class discrepancy among child prompts while preserving semantic consistency between parent-child prompt pairs. To prevent excessive growth of the prompt sets, we utilize the maximum angular coverage (MAC) of the semantic space as a criterion for early termination. We demonstrate the effectiveness of DiPEx through extensive class-agnostic OD and OOD-OD experiments on MS-COCO and LVIS, surpassing other prompting methods by up to 20.1% in AR and achieving a 21.3% AP improvement over SAM. The code is available at https://github.com/jason-lim26/DiPEx.

翻译：类无关目标检测（OD）是许多下游视觉任务的基石或瓶颈。尽管利用基本视觉线索识别显著对象的自底向上和多目标发现方法已取得显著进展，但由于目标类型的多样性及其上下文复杂性，持续实现高召回率仍然困难。在本研究中，我们探索通过自监督提示学习策略利用视觉-语言模型（VLM）增强目标检测。初步发现表明，手动构建的文本查询常导致漏检目标，主要原因是当查询词存在语义重叠时检测置信度会下降。为解决此问题，我们提出一种分散式提示扩展（DiPEx）方法。DiPEx通过渐进式学习扩展一组互异且非重叠的超球面提示来提升召回率，从而改善下游任务（如分布外目标检测）的性能。具体而言，DiPEx首先通过自训练生成通用父提示，并选择语义不确定性最高的提示进行扩展。生成的子提示需继承父提示语义，同时捕获更细粒度的语义特征。我们采用分散损失确保子提示间的高类间差异，同时保持父子提示对间的语义一致性。为防止提示集过度增长，我们以语义空间的最大角度覆盖（MAC）作为早期终止准则。通过在MS-COCO和LVIS数据集上进行大量类无关目标检测及分布外目标检测实验，我们验证了DiPEx的有效性：其在AR指标上超越其他提示方法达20.1%，相比SAM实现21.3%的AP提升。代码发布于https://github.com/jason-lim26/DiPEx。