This paper discusses the feasibility of continuously training the CLIP model through streaming data. Then, by tracking the directional changes of the representation vectors in the continuously updated CLIP model, we explore and summarize these spatial variations as Spatial Disorder (SD), which can be divided into Intra-modal Rotation and Inter-modal Deviation. Moreover, we demonstrate how intra-modal rotation and inter-modal deviation lead to a performance decline for CLIP on cross-modal retrieval tasks in both empirically and theoretically. To alleviate the spatial disorder, we propose a simple yet effective continual learning framework Mod-X: \textbf{M}aintain \textbf{o}ff-\textbf{d}iagonal information-matri\textbf{X}. The experiments (in Section \ref{method}, \ref{experiments} and Appendix \ref{Appendix_to_experiments}) on commonly used datasets with different scales and scopes have illustrated the effectiveness of our method.
翻译:本文探讨了通过流数据持续训练CLIP模型的可行性。接着,通过追踪持续更新的CLIP模型中表示向量的方向变化,我们将这些空间变化探索并总结为空间无序(Spatial Disorder, SD),可进一步分为模态内旋转(Intra-modal Rotation)与模态间偏移(Inter-modal Deviation)。此外,我们从经验与理论两方面论证了模态内旋转与模态间偏移如何导致CLIP模型在跨模态检索任务中的性能下降。为缓解空间无序问题,我们提出了一种简单而有效的持续学习框架Mod-X:\textbf{维}护\textbf{非}对角线信息矩\textbf{阵}。在方法(第\ref{method}节)、实验(第\ref{experiments}节)及附录(附录\ref{Appendix_to_experiments})中,针对不同规模与范围的常用数据集进行的实验验证了我们方法的有效性。