Vision-Language Pretraining (VLP) has shown impressive results on diverse downstream tasks by offline training on large-scale datasets. Regarding the growing nature of real-world data, such an offline training paradigm on ever-expanding data is unsustainable, because models lack the continual learning ability to accumulate knowledge constantly. However, most continual learning studies are limited to uni-modal classification and existing multi-modal datasets cannot simulate continual non-stationary data stream scenarios. To support the study of Vision-Language Continual Pretraining (VLCP), we first contribute a comprehensive and unified benchmark dataset P9D which contains over one million product image-text pairs from 9 industries. The data from each industry as an independent task supports continual learning and conforms to the real-world long-tail nature to simulate pretraining on web data. We comprehensively study the characteristics and challenges of VLCP, and propose a new algorithm: Compatible momentum contrast with Topology Preservation, dubbed CTP. The compatible momentum model absorbs the knowledge of the current and previous-task models to flexibly update the modal feature. Moreover, Topology Preservation transfers the knowledge of embedding across tasks while preserving the flexibility of feature adjustment. The experimental results demonstrate our method not only achieves superior performance compared with other baselines but also does not bring an expensive training burden. Dataset and codes are available at https://github.com/KevinLight831/CTP.
翻译:视觉-语言预训练通过在大规模数据集上进行离线训练,在下游任务中展现出卓越性能。然而,由于真实世界数据具有持续增长的特性,这种针对不断扩展数据集的离线训练范式难以持续,因为模型缺乏持续学习能力以不断积累知识。现有持续学习研究多局限于单模态分类任务,且多模态数据集无法模拟持续非平稳数据流场景。为支持视觉-语言连续预训练研究,我们首先构建了包含来自9个行业超百万产品图文对的全方位统一基准数据集P9D。每个行业数据作为独立任务支持持续学习,并符合真实世界长尾分布特性以模拟网络数据预训练。我们深入研究了视觉-语言连续预训练的特性与挑战,提出新型算法:兼容动量对比与拓扑保持(CTP)。兼容动量模型通过融合当前任务与先前任务模型的知识,灵活更新模态特征;拓扑保持方法在保留特征调整灵活性的同时,跨任务传递嵌入知识。实验结果表明,本方法不仅性能优于其他基线,且未带来额外训练负担。数据集与代码已开源至https://github.com/KevinLight831/CTP。