Large-scale multi-modal contrastive learning frameworks like CLIP typically require a large amount of image-text samples for training. However, these samples are always collected continuously in real scenarios. This paper discusses the feasibility of continual CLIP training using streaming data. Unlike continual learning based on self-supervised learning methods for pure images, which is empirically robust against catastrophic forgetting, CLIP's performance degeneration in the continual setting is significant and non-neglectable. By analyzing the changes in the model's representation space during continual CLIP training from a spatial geometry perspective, we explore and summarize these spatial variations as Spatial Disorder (SD), which can be divided into Intra-modal Rotation and Inter-modal Deviation. Moreover, we empirically and theoretically demonstrate how SD leads to a performance decline for CLIP on cross-modal retrieval tasks. To alleviate SD, we propose a new continual vision-language representation learning framework Mod-X: Maintain off-diagonal information-matriX. By selectively aligning the off-diagonal information distribution of contrastive matrices, the Mod-X improves the capability of the multi-modal model by maintaining the multi-modal representation space alignment on the old data domain during continuously fitting the new training data domain. Experiments on commonly used datasets with different scales and scopes have demonstrated the effectiveness of our method.
翻译:大规模多模态对比学习框架(如CLIP)通常需要大量图像-文本样本进行训练。然而,这些样本在实际场景中往往是连续采集的。本文探讨了使用流式数据进行连续CLIP训练的可行性。与基于纯图像自监督学习方法的连续学习(该方法经验上对灾难性遗忘具有鲁棒性)不同,CLIP在连续设置下的性能退化显著且不可忽视。通过从空间几何角度分析连续CLIP训练过程中模型表示空间的变化,我们将这些空间变化探索并总结为空间无序(Spatial Disorder, SD),其可分解为模态内旋转(Intra-modal Rotation)和模态间偏差(Inter-modal Deviation)。此外,我们从经验与理论角度论证了SD如何导致CLIP在跨模态检索任务上的性能下降。为缓解SD,我们提出了一种新的连续视觉-语言表示学习框架Mod-X:维护非对角线信息矩阵(Maintain off-diagonal information-matriX)。通过选择性对齐对比矩阵的非对角线信息分布,Mod-X在持续适应新训练数据域的同时,保持旧数据域上的多模态表示空间对齐,从而提升多模态模型的能力。在常用数据集上不同规模与范围下的实验验证了本方法的有效性。