This work proposes POMP, a prompt pre-training method for vision-language models. Being memory and computation efficient, POMP enables the learned prompt to condense semantic information for a rich set of visual concepts with over twenty-thousand classes. Once pre-trained, the prompt with a strong transferable ability can be directly plugged into a variety of visual recognition tasks including image classification, semantic segmentation, and object detection, to boost recognition performances in a zero-shot manner. Empirical evaluation shows that POMP achieves state-of-the-art performances on 21 downstream datasets, e.g., 67.0% average accuracy on 10 classification dataset (+3.1% compared to CoOp) and 84.4 hIoU on open-vocabulary Pascal VOC segmentation (+6.9 compared to ZSSeg).
翻译:本文提出POMP——一种针对视觉-语言模型的提示预训练方法。POMP兼具内存高效与计算高效特性,能够使学习到的提示浓缩涵盖超过两万类视觉概念的丰富语义信息。预训练后,具备强迁移能力的提示可直接接入图像分类、语义分割与目标检测等多种视觉识别任务,以零样本方式提升识别性能。实验评估表明,POMP在21个下游数据集上达到最优性能:例如在10个分类数据集上平均准确率达67.0%(相较CoOp提升3.1%),在开放词汇Pascal VOC分割任务上实现84.4 hIoU(相较ZSSeg提升6.9)。