Pre-trained Vision-Language (V-L) models set the benchmark for generalization to downstream tasks among the noteworthy contenders. Many characteristics of the V-L model have been explored in existing research including the challenge of the sensitivity to text input and the tuning process across multi-modal prompts. With the advanced utilization of the V-L model like CLIP, recent approaches deploy learnable prompts instead of hand-craft prompts to boost the generalization performance and address the aforementioned challenges. Inspired by layer-wise training, which is wildly used in image fusion, we note that using a sequential training process to adapt different modalities branches of CLIP efficiently facilitates the improvement of generalization. In the context of addressing the multi-modal prompting challenge, we propose Token-wise Adaptive for Multi-modal Prompt Learning (APLe) for tuning both modalities prompts, vision and language, as tokens in a sequential manner. APLe addresses the challenges in V-L models to promote prompt learning across both modalities, which indicates a competitive generalization performance in line with the state-of-the-art. Preeminently, APLe shows robustness and favourable performance in prompt-length experiments with an absolute advantage in adopting the V-L models.
翻译:预训练的视觉-语言(Vision-Language, V-L)模型在下游任务泛化能力方面树立了标杆,成为重要竞争方案中的佼佼者。现有研究已探索了V-L模型的诸多特性,包括对文本输入的敏感性挑战以及跨多模态提示的调优过程。随着CLIP等V-L模型的先进应用,近期方法采用可学习提示替代手工设计提示,以提升泛化性能并解决上述挑战。受广泛应用于图像融合的逐层训练启发,我们注意到采用序列化训练过程来适配CLIP不同模态分支,可高效促进泛化能力的提升。针对多模态提示学习的挑战,我们提出面向多模态提示学习的逐令牌自适应方法(APLe),以序列化方式对视觉与语言两种模态的提示令牌进行联合调优。APLe通过解决V-L模型中的关键挑战,推动跨双模态的提示学习,其泛化性能达到与当前最先进方法相当的水平。尤为突出的是,APLe在提示长度实验中展现出鲁棒性与优越性能,并在V-L模型应用中呈现出绝对优势。