Continual vision-language models are commonly addressed through sequential fine-tuning; however, although this paradigm enables adaptation to new environments (tasks), it inherently emphasizes the contribution of previously learned environments (tasks) at the expense of the stability required to preserve previously acquired knowledge. While existing approaches have adequately studied continual learning and catastrophic forgetting in vision-language models (VLMs), the theoretical understanding of modality-specific contributions across a sequence of environments remains largely unexplored. In this paper, we present a new theoretical perspective to understand the cross-modal (vision-language) contributions to consecutive environments. We empirically evaluate our theoretical findings on large VLMs and demonstrate their effectiveness in capturing environment-level cross-modal contributions. Our analysis provides deeper insights into continual VLMs, highlighting their contribution robustness to varying task orders and inter-task similarities, and their improved generalization performance.
翻译:持续视觉-语言模型通常通过顺序微调来解决;然而,尽管这种范式能够适应新环境(任务),但它本质上以牺牲保留先前获取知识所需稳定性为代价,强调了先前学习环境(任务)的贡献。虽然现有方法已充分研究了视觉-语言模型(VLM)中的持续学习和灾难性遗忘,但对模态特定贡献在连续环境序列中的理论理解仍基本未被探索。本文提出了一种新的理论视角,以理解跨模态(视觉-语言)对连续环境的贡献。我们在大规模VLM上实证评估了我们的理论发现,并展示了其在捕捉环境级跨模态贡献方面的有效性。我们的分析为持续VLM提供了更深入的见解,凸显了它们对任务顺序和任务间相似性的贡献鲁棒性,以及改进的泛化性能。