Recent advancements in multimodal foundation models (e.g., CLIP) have excelled in zero-shot generalization. Prompt tuning involved in the knowledge transfer from foundation models to downstream tasks has gained significant attention recently. Existing prompt-tuning methods in cross-modal learning, however, either solely focus on language branch, or learn vision-language interaction in a shallow mechanism. In this context, we propose a Deeply coupled Cross-modal Prompt learning (DCP) method based on CLIP. DCP flexibly accommodates the interplay between vision and language with a Cross-Modal Prompt Attention (CMPA) mechanism, which enables the mutual exchange of respective representation through a well-connected multi-head attention module progressively and strongly. We then conduct comprehensive few-shot learning experiments on 11 image classification datasets and analyze the robustness to domain shift as well. Thorough experimental analysis evidently demonstrates the superb few-shot generalization and compelling domain adaption capacity of a well-executed DCP. The code can be found at https://github.com/GingL/CMPA.
翻译:近期,多模态基础模型(如CLIP)在零样本泛化方面取得了卓越进展。将基础模型知识迁移至下游任务时涉及的提示调优技术,近年来获得了广泛关注。然而,现有跨模态学习中的提示调优方法,或仅聚焦于语言分支,或通过浅层机制学习视觉-语言交互。针对这一现状,我们提出基于CLIP的深度耦合跨模态提示学习方法(DCP)。该方法通过跨模态提示注意力机制(CMPA)灵活实现视觉与语言之间的交互,借助紧密连接的多头注意力模块逐步且强耦合地促进双模态表征的相互交换。我们在11个图像分类数据集上开展了全面的少样本学习实验,并分析了模型对领域偏移的鲁棒性。详尽的实验分析充分表明,经过良好训练的DCP具有卓越的少样本泛化能力与令人信服的领域自适应性能。相关代码已开源至https://github.com/GingL/CMPA。