In recent years, soft prompt learning methods have been proposed to fine-tune large-scale vision-language pre-trained models for various downstream tasks. These methods typically combine learnable textual tokens with class tokens as input for models with frozen parameters. However, they often employ a single prompt to describe class contexts, failing to capture categories' diverse attributes adequately. This study introduces the Partitioned Multi-modal Prompt (PMPO), a multi-modal prompting technique that extends the soft prompt from a single learnable prompt to multiple prompts. Our method divides the visual encoder depths and connects learnable prompts to the separated visual depths, enabling different prompts to capture the hierarchical contextual depths of visual representations. Furthermore, to maximize the advantages of multi-prompt learning, we incorporate prior information from manually designed templates and learnable multi-prompts, thus improving the generalization capabilities of our approach. We evaluate the effectiveness of our approach on three challenging tasks: new class generalization, cross-dataset evaluation, and domain generalization. For instance, our method achieves a $79.28$ harmonic mean, averaged over 11 diverse image recognition datasets ($+7.62$ compared to CoOp), demonstrating significant competitiveness compared to state-of-the-art prompting methods.
翻译:近年来,软提示学习方法被提出用于微调大规模视觉-语言预训练模型,以适应各种下游任务。这些方法通常将可学习的文本标记与类别标记结合作为冻结参数模型的输入。然而,它们常采用单一提示描述类别上下文,难以充分捕捉类别的多样属性。本研究引入分区多模态提示(PMPO),一种将软提示从单一可学习提示扩展为多个提示的多模态提示技术。我们的方法对视觉编码器深度进行分区,并将可学习提示连接到分离的视觉深度层级,使不同提示能够捕捉视觉表征的分层上下文深度。此外,为最大化多提示学习的优势,我们结合了手工设计模板的先验信息与可学习的多提示,从而提升了方法的泛化能力。我们在三项具有挑战性的任务上评估了该方法:新类别泛化、跨数据集评估及域泛化。例如,我们的方法在11个多样化图像识别数据集上实现了79.28的调和平均值(相比CoOp提升7.62),显示出与最先进提示方法相比的显著竞争力。