Large-scale multi-modal contrastive learning frameworks like CLIP typically require a large amount of image-text samples for training. However, these samples are always collected continuously in real scenarios. This paper discusses the feasibility of continual CLIP training using streaming data. Unlike continual learning based on self-supervised learning methods for pure images, which is empirically robust against catastrophic forgetting, CLIP's performance degeneration in the continual setting is significant and non-neglectable. By analyzing the changes in the model's representation space during continual CLIP training from a spatial geometry perspective, we explore and summarize these spatial variations as Spatial Disorder (SD), which can be divided into Intra-modal Rotation and Inter-modal Deviation. Moreover, we empirically and theoretically demonstrate how SD leads to a performance decline for CLIP on cross-modal retrieval tasks. To alleviate SD, we propose a new continual vision-language representation learning framework Mod-X: Maintain off-diagonal information-matriX. By selectively aligning the off-diagonal information distribution of contrastive matrices, the Mod-X improves the capability of the multi-modal model by maintaining the multi-modal representation space alignment on the old data domain during continuously fitting the new training data domain. Experiments on commonly used datasets with different scales and scopes have demonstrated the effectiveness of our method.
翻译:大规模多模态对比学习框架(如CLIP)通常需要大量图像-文本样本进行训练。然而,在现实场景中,这些样本往往以流式数据的形式持续收集。本文探讨了使用流式数据进行持续CLIP训练的可行性。与基于纯图像自监督学习方法的持续学习不同(该场景经验性地具有对抗灾难性遗忘的鲁棒性),CLIP在持续学习环境下的性能退化显著且不可忽视。通过从空间几何角度分析持续CLIP训练过程中模型表示空间的变化,我们探索并总结了这些空间变化为空间紊乱(Spatial Disorder, SD),其可分解为模态内旋转(Intra-modal Rotation)和模态间偏移(Inter-modal Deviation)。此外,我们从实证和理论层面论证了SD如何导致CLIP在跨模态检索任务上的性能下降。为缓解SD,我们提出了一种新的持续视觉-语言表示学习框架Mod-X:维持非对角信息矩阵(Maintain off-diagonal information-matriX)。通过选择性对齐对比矩阵的非对角信息分布,Mod-X在持续拟合新训练数据域时,通过维持旧数据域上多模态表示空间的对齐性,提升了多模态模型的能力。在多种不同规模和范围的常用数据集上的实验证明了我们方法的有效性。