While vision-language pre-trained models (VL-PTMs) have advanced multimodal research in recent years, their mastery in a few languages like English restricts their applicability in broader communities. To this end, there is an increasing interest in developing multilingual VL models via a joint-learning setup, which, however, could be unrealistic due to expensive costs and data availability. In this work, we propose to extend VL-PTMs' language capacity by continual language learning (CLL), where a model needs to update its linguistic knowledge incrementally without suffering from catastrophic forgetting (CF). We begin our study by introducing a model dubbed CLL-CLIP, which builds upon CLIP, a prevailing VL-PTM that has acquired image-English text alignment. Specifically, CLL-CLIP contains an expandable token embedding layer to handle linguistic differences. It solely trains token embeddings to improve memory stability and is optimized under cross-modal and cross-lingual objectives to learn the alignment between images and multilingual texts. To alleviate CF raised by covariate shift and lexical overlap, we further propose a novel approach that ensures the identical distribution of all token embeddings during initialization and regularizes token embedding learning during training. We construct a CLL benchmark covering 36 languages based on MSCOCO and XM3600 datasets and then evaluate multilingual image-text retrieval performance. Extensive experiments verify the effectiveness of CLL-CLIP and show that our approach can boost CLL-CLIP, e.g., by 6.7% in text-to-image average Recall@1 on XM3600, and improve various state-of-the-art methods consistently. Our code and data are available at \url{https://github.com/yangbang18/CLFM}.
翻译:尽管视觉-语言预训练模型近年来推动了多模态研究的进展,但其对英语等少数语言的掌握限制了其在更广泛社区中的应用。为此,通过联合学习框架开发多语言视觉-语言模型的研究日益兴起,然而由于高昂成本和数据可用性问题,这种方案可能不切实际。本文提出通过持续语言学习扩展视觉-语言预训练模型的语言能力,该范式要求模型在不遭受灾难性遗忘的前提下逐步更新语言知识。我们首先引入名为CLL-CLIP的模型,其基于主流视觉-语言预训练模型CLIP构建,已具备图像-英文文本对齐能力。具体而言,CLL-CLIP包含可扩展的Token嵌入层以处理语言差异,仅通过训练Token嵌入来提升记忆稳定性,并在跨模态与跨语言目标下优化以实现图像与多语言文本的对齐。为缓解由协变量偏移和词汇重叠引发的灾难性遗忘,我们进一步提出创新方法:在初始化时确保所有Token嵌入的分布一致,并在训练过程中对Token嵌入学习进行正则化。基于MSCOCO和XM3600数据集构建了覆盖36种语言的持续语言学习基准,并评估多语言图像-文本检索性能。大量实验验证了CLL-CLIP的有效性,表明本方法可提升CLL-CLIP性能(例如在XM3600数据集上文本到图像平均Recall@1提升6.7%),并持续改进多种现有最优方法。代码及数据已开源至\url{https://github.com/yangbang18/CLFM}。