In recent years, soft prompt learning methods have been proposed to fine-tune large-scale vision-language pre-trained models for various downstream tasks. These methods typically combine learnable textual tokens with class tokens as input for models with frozen parameters. However, they often employ a single prompt to describe class contexts, failing to capture categories' diverse attributes adequately. This study introduces the Partitioned Multi-modal Prompt (PMPO), a multi-modal prompting technique that extends the soft prompt from a single learnable prompt to multiple prompts. Our method divides the visual encoder depths and connects learnable prompts to the separated visual depths, enabling different prompts to capture the hierarchical contextual depths of visual representations. Furthermore, to maximize the advantages of multi-prompt learning, we incorporate prior information from manually designed templates and learnable multi-prompts, thus improving the generalization capabilities of our approach. We evaluate the effectiveness of our approach on three challenging tasks: new class generalization, cross-dataset evaluation, and domain generalization. For instance, our method achieves a $79.28$ harmonic mean, averaged over 11 diverse image recognition datasets ($+7.62$ compared to CoOp), demonstrating significant competitiveness compared to state-of-the-art prompting methods.
翻译:近年来,软提示学习方法被提出,用于针对各种下游任务微调大规模视觉-语言预训练模型。这些方法通常将可学习的文本标记与类别标记结合,作为冻结参数模型的输入。然而,它们往往采用单一提示来描述类别上下文,无法充分捕捉类别多样化的属性。本研究引入了分区多模态提示(PMPO),这是一种将软提示从单个可学习提示扩展到多个提示的多模态提示技术。我们的方法划分视觉编码器的深度,并将可学习提示连接到分离的视觉深度上,从而使不同提示能够捕捉视觉表示的分层上下文深度。此外,为充分利用多提示学习的优势,我们融合了手工设计模板的先验信息和可学习多提示,从而提升了方法的泛化能力。我们在三项具有挑战性的任务上评估了方法的有效性:新类别泛化、跨数据集评估和域泛化。例如,我们的方法在11个不同图像识别数据集上实现了79.28的调和平均值(与CoOp相比提升+7.62),展现出与最先进的提示方法相比显著的竞争力。