Recent open-vocabulary detection methods aim to detect novel objects by distilling knowledge from vision-language models (VLMs) trained on a vast amount of image-text pairs. To improve the effectiveness of these methods, researchers have utilized datasets with a large vocabulary that contains a large number of object classes, under the assumption that such data will enable models to extract comprehensive knowledge on the relationships between various objects and better generalize to unseen object classes. In this study, we argue that more fine-grained labels are necessary to extract richer knowledge about novel objects, including object attributes and relationships, in addition to their names. To address this challenge, we propose a simple and effective method named Pseudo Caption Labeling (PCL), which utilizes an image captioning model to generate captions that describe object instances from diverse perspectives. The resulting pseudo caption labels offer dense samples for knowledge distillation. On the LVIS benchmark, our best model trained on the de-duplicated VisualGenome dataset achieves an AP of 34.5 and an APr of 30.6, comparable to the state-of-the-art performance. PCL's simplicity and flexibility are other notable features, as it is a straightforward pre-processing technique that can be used with any image captioning model without imposing any restrictions on model architecture or training process.
翻译:近期开放词汇检测方法旨在通过从海量图像-文本对训练的大规模视觉-语言模型(VLM)中提炼知识来检测未见类别对象。为提升这些方法的有效性,研究者们利用包含大量对象类别的宽词汇数据集,其假设在于这类数据能使模型提取关于各类对象关系的全面知识,并更好地泛化至未见对象类别。本研究中,我们论证了除对象名称外,还需更细粒度标签以提取关于新颖对象的更丰富知识,包括对象属性及关系。为解决该挑战,我们提出一种简单有效的方法——伪描述标签生成(PCL),该方法利用图像描述模型生成从多角度描述对象实例的文本。由此产生的伪描述标签为知识蒸馏提供了密集样本。在LVIS基准测试中,基于去重VisualGenome数据集训练的最佳模型实现了34.5的AP和30.6的APr,与当前最优性能相当。PCL的简洁性与灵活性是其显著特点——作为可直接应用的预处理技术,它可与任意图像描述模型配合使用,且不对模型架构或训练过程施加任何限制。