Text prompts are crucial for generalizing pre-trained open-set object detection models to new categories. However, current methods for text prompts are limited as they require manual feedback when generalizing to new categories, which restricts their ability to model complex scenes, often leading to incorrect detection results. To address this limitation, we propose a novel visual prompt method that learns new category knowledge from a few labeled images, which generalizes the pre-trained detection model to the new category. To allow visual prompts to represent new categories adequately, we propose a statistical-based prompt construction module that is not limited by predefined vocabulary lengths, thus allowing more vectors to be used when representing categories. We further utilize the category dictionaries in the pre-training dataset to design task-specific similarity dictionaries, which make visual prompts more discriminative. We evaluate the method on the ODinW dataset and show that it outperforms existing prompt learning methods and performs more consistently in combinatorial inference.
翻译:文本提示对于将预训练的开放集目标检测模型泛化至新类别至关重要。然而,当前的文本提示方法在泛化至新类别时需要人工反馈,这限制了其对复杂场景的建模能力,常导致检测结果错误。为解决这一局限,我们提出一种新颖的视觉提示方法,该方法从少量标注图像中学习新类别知识,从而将预训练检测模型泛化至新类别。为使视觉提示充分表征新类别,我们提出基于统计的提示构建模块,该模块不受预设词汇长度限制,因此在表征类别时可使用更多向量。我们进一步利用预训练数据集中的类别词典设计任务特定相似性词典,使视觉提示更具判别性。我们在ODinW数据集上评估该方法,结果表明其优于现有提示学习方法,且在组合推理中表现更稳定。