Recent advancements in multimodal foundation models (e.g., CLIP) have excelled in zero-shot generalization. Prompt tuning involved in the knowledge transfer from foundation models to downstream tasks has gained significant attention recently. Existing prompt-tuning methods in cross-modal learning, however, either solely focus on language branch, or learn vision-language interaction in a shallow mechanism. In this context, we propose a Deeply coupled Cross-modal Prompt learning (DCP) method based on CLIP. DCP flexibly accommodates the interplay between vision and language with a Cross-Modal Prompt Attention (CMPA) mechanism, which enables the mutual exchange of respective representation through a well-connected multi-head attention module progressively and strongly. We then conduct comprehensive few-shot learning experiments on 11 image classification datasets and analyze the robustness to domain shift as well. Thorough experimental analysis evidently demonstrates the superb few-shot generalization and compelling domain adaption capacity of a well-executed DCP. The code can be found at \href{https://github.com/GingL/CMPA}{https://github.com/GingL/CMPA}.
翻译:近年来,以CLIP为代表的多模态基础模型在零样本泛化方面取得了卓越进展。提示调优作为将基础模型知识迁移至下游任务的关键技术,已获得广泛关注。然而,现有跨模态学习的提示调优方法或仅聚焦于语言分支,或通过浅层机制学习视觉-语言交互。为此,本文提出一种基于CLIP的深度耦合跨模态提示学习方法(DCP)。DCP通过跨模态提示注意力机制(CMPA)灵活协调视觉与语言间的相互作用,借助紧密连接的多头注意力模块逐步且强有力地实现各表征的相互交换。我们在11个图像分类数据集上开展全面的小样本学习实验,并分析了该方法对领域偏移的鲁棒性。充分的实验分析表明,高效执行的DCP具备卓越的小样本泛化能力与引人注目的领域自适应性能。代码详见:\href{https://github.com/GingL/CMPA}{https://github.com/GingL/CMPA}。