This paper discusses the feasibility of continuously training the CLIP model through streaming data. Then, by tracking the directional changes of the representation vectors in the continuously updated CLIP model, we explore and summarize these spatial variations as Spatial Disorder (SD), which can be divided into Intra-modal Rotation and Inter-modal Deviation. Moreover, we demonstrate how intra-modal rotation and inter-modal deviation lead to a performance decline for CLIP on cross-modal retrieval tasks in both empirically and theoretically. To alleviate the spatial disorder, we propose a simple yet effective continual learning framework Mod-X: \textbf{M}aintain \textbf{o}ff-\textbf{d}iagonal information-matri\textbf{X}. The experiments (in Section \ref{method}, \ref{experiments} and Appendix \ref{Appendix_to_experiments}) on commonly used datasets with different scales and scopes have illustrated the effectiveness of our method.
翻译:本文探讨了通过流式数据持续训练CLIP模型的可行性。通过追踪持续更新CLIP模型中表征向量的方向变化,我们探索并总结了这些空间变化规律,将其定义为空间紊乱(Spatial Disorder, SD),具体可分为模态内旋转(Intra-modal Rotation)与模态间偏移(Inter-modal Deviation)。此外,我们从经验与理论两个层面论证了模态内旋转与模态间偏移如何导致CLIP在跨模态检索任务中的性能下降。为缓解空间紊乱现象,我们提出一种简单而有效的连续学习框架Mod-X:保持非对角线信息矩阵(Maintain off-diagonal information-matriX)。在涵盖不同规模与范围的主流数据集上的实验(详见第\ref{method}节方法、第\ref{experiments}节实验及附录\ref{Appendix_to_experiments})验证了本方法的有效性。