面向持续学习的视觉语言模型多阶段知识整合 (Multi-Stage Knowledge Integration of Vision-Language Models for Continual Learning)

Vision Language Models (VLMs), pre-trained on large-scale image-text datasets, enable zero-shot predictions for unseen data but may underperform on specific unseen tasks. Continual learning (CL) can help VLMs effectively adapt to new data distributions without joint training, but faces challenges of catastrophic forgetting and generalization forgetting. Although significant progress has been achieved by distillation-based methods, they exhibit two severe limitations. One is the popularly adopted single-teacher paradigm fails to impart comprehensive knowledge, The other is the existing methods inadequately leverage the multimodal information in the original training dataset, instead they rely on additional data for distillation, which increases computational and storage overhead. To mitigate both limitations, by drawing on Knowledge Integration Theory (KIT), we propose a Multi-Stage Knowledge Integration network (MulKI) to emulate the human learning process in distillation methods. MulKI achieves this through four stages, including Eliciting Ideas, Adding New Ideas, Distinguishing Ideas, and Making Connections. During the four stages, we first leverage prototypes to align across modalities, eliciting cross-modal knowledge, then adding new knowledge by constructing fine-grained intra- and inter-modality relationships with prototypes. After that, knowledge from two teacher models is adaptively distinguished and re-weighted. Finally, we connect between models from intra- and inter-task, integrating preceding and new knowledge. Our method demonstrates significant improvements in maintaining zero-shot capabilities while supporting continual learning across diverse downstream tasks, showcasing its potential in adapting VLMs to evolving data distributions.

翻译：视觉语言模型（VLMs）在大规模图文数据集上进行预训练，能够对未见数据实现零样本预测，但在特定未见任务上可能表现欠佳。持续学习（CL）可帮助VLMs有效适应新的数据分布而无需联合训练，但面临灾难性遗忘与泛化遗忘的挑战。尽管基于蒸馏的方法已取得显著进展，它们仍存在两个严重局限：一是普遍采用的单教师范式难以传递全面知识；二是现有方法未能充分利用原始训练数据集中的多模态信息，反而依赖额外数据进行蒸馏，增加了计算与存储开销。为缓解这些局限，我们借鉴知识整合理论（KIT），提出一种多阶段知识整合网络（MulKI），以模拟人类在蒸馏方法中的学习过程。MulKI通过四个阶段实现这一目标，包括"观点激发"、"观点补充"、"观点辨析"与"观点联结"。在这四个阶段中，我们首先利用原型实现跨模态对齐以激发跨模态知识，随后通过构建细粒度的模态内与模态间原型关系补充新知识。此后，对来自双教师模型的知识进行自适应区分与重加权。最后，在任务内与任务间建立模型连接，整合先验知识与新知识。我们的方法在保持零样本能力的同时，显著提升了跨多样下游任务的持续学习性能，展现了适应动态数据分布的潜力。