Prompt tuning, like CoOp, has recently shown promising vision recognizing and transfer learning ability on various downstream tasks with the emergence of large pre-trained vision-language models like CLIP. However, we identify that existing uni-modal prompt tuning approaches may result in sub-optimal performance since this uni-modal design breaks the original alignment of textual and visual representations in the pre-trained model. Inspired by the nature of pre-trained vision-language models, we aim to achieve completeness in prompt tuning and propose a novel approach called Multi-modal Deep-symphysis Prompt Tuning, dubbed as MuDPT, which extends independent multi-modal prompt tuning by additionally learning a model-agnostic transformative network to allow deep hierarchical bi-directional prompt fusion. We evaluate the effectiveness of MuDPT on few-shot vision recognition and out-of-domain generalization tasks. Compared with the state-of-the-art methods, MuDPT achieves better recognition and generalization ability with an apparent margin thanks to synergistic alignment of textual and visual representations. Our code is available at: https://github.com/Mechrev0/MuDPT.
翻译:提示微调(如CoOp)近年来随着CLIP等大型预训练视觉-语言模型的出现,在各类下游任务中展现出显著的视觉识别与迁移学习能力。然而,我们发现现有的单模态提示微调方法可能导致次优性能,因为这种单模态设计破坏了预训练模型中文本与视觉表示的原始对齐。受预训练视觉-语言模型本质的启发,我们旨在实现提示微调的完备性,并提出一种名为多模态深度融合提示微调(MuDPT)的新方法。该方法通过额外学习一个与模型无关的变换网络,扩展了独立的双模态提示微调,从而实现深度分层双向提示融合。我们在小样本视觉识别和域外泛化任务上评估了MuDPT的有效性。与当前最先进方法相比,MuDPT凭借文本与视觉表示的协同对齐,以显著优势取得了更优的识别与泛化能力。我们的代码开源地址:https://github.com/Mechrev0/MuDPT。