Large vision-language models (VLMs) have shown significant performance boost in various application domains. However, adopting them to deal with several sequentially encountered tasks has been challenging because finetuning a VLM on a task normally leads to reducing its generalization power and the capacity of learning new tasks as well as causing catastrophic forgetting on previously learned tasks. Enabling using VLMs in multimodal continual learning (CL) settings can help to address such scenarios. To improve generalization capacity and prevent catastrophic forgetting, we propose a novel prompt-based CL method for VLMs, namely $\textbf{Clu}$ster-based $\textbf{Mo}$dality Fusion Prompt (\textbf{CluMo}). We design a novel \textbf{Key-Key-Prompt} pair, where each prompt is associated with a visual prompt key and a textual prompt key. We adopt a two-stage training strategy. During the first stage, the single-modal keys are trained via $K$-means clustering algorithm to help select the best semantically matched prompt. During the second stage, the prompt keys are frozen, the selected prompt is attached to the input for training the VLM in the CL scenario. Experiments on two benchmarks demonstrate that our method achieves SOTA performance.
翻译:大型视觉语言模型(VLMs)已在多个应用领域展现出显著的性能提升。然而,将其应用于处理一系列顺序出现的任务时面临挑战,因为对VLM进行任务微调通常会降低其泛化能力、学习新任务的能力,并导致对已学任务的灾难性遗忘。在多模态持续学习(CL)场景中启用VLMs有助于应对此类情况。为提升泛化能力并防止灾难性遗忘,我们提出了一种新颖的基于提示的VLM持续学习方法,即基于聚类的模态融合提示(CluMo)。我们设计了一种新颖的键-键-提示对,其中每个提示关联一个视觉提示键和一个文本提示键。我们采用两阶段训练策略:在第一阶段,通过K-means聚类算法训练单模态键,以帮助选择语义匹配最佳的提示;在第二阶段,提示键被冻结,所选提示附加到输入中,用于在CL场景下训练VLM。在两个基准数据集上的实验表明,我们的方法取得了最先进的性能。