This paper explores the problem of continual learning (CL) of vision-language models (VLMs) in open domains, where the models need to perform continual updating and inference on a streaming of datasets from diverse seen and unseen domains with novel classes. Such a capability is crucial for various applications in open environments, e.g., AI assistants, autonomous driving systems, and robotics. Current CL studies mostly focus on closed-set scenarios in a single domain with known classes. Large pre-trained VLMs like CLIP have demonstrated superior zero-shot recognition ability, and a number of recent studies leverage this ability to mitigate catastrophic forgetting in CL, but they focus on closed-set CL in a single domain dataset. Open-domain CL of large VLMs is significantly more challenging due to 1) large class correlations and domain gaps across the datasets and 2) the forgetting of zero-shot knowledge in the pre-trained VLMs in addition to the knowledge learned from the newly adapted datasets. In this work we introduce a novel approach, termed CoLeCLIP, that learns an open-domain CL model based on CLIP. It addresses these challenges by a joint learning of a set of task prompts and a cross-domain class vocabulary. Extensive experiments on 11 domain datasets show that CoLeCLIP outperforms state-of-the-art methods for open-domain CL under both task- and class-incremental learning settings.
翻译:本文探索了开放域中视觉语言模型(VLM)的连续学习(CL)问题,要求模型能够对来自不同已知与未知领域、包含新类别的数据流进行连续更新与推理。该能力在开放环境下的诸多应用(如AI助手、自动驾驶系统及机器人技术)中至关重要。现有连续学习研究多聚焦于单领域封闭集场景下的已知类别。大规模预训练VLM(如CLIP)已展现出卓越的零样本识别能力,近期研究利用该能力缓解CL中的灾难性遗忘,但仅局限于单数据集封闭集场景。大规模VLM的开放域连续学习更具挑战性,原因在于:1)跨数据集存在强类别关联性与领域差异;2)除新适配数据集习得知识外,预训练VLM的零样本知识也会被遗忘。本文提出一种名为CoLeCLIP的创新方法,基于CLIP构建开放域连续学习模型。该方法通过联合学习一组任务提示与跨领域类别词汇表来应对上述挑战。在11个领域数据集上的大量实验表明,CoLeCLIP在任务增量学习与类增量学习两种设置下均优于现有开放域连续学习最优方法。