Embracing Language Inclusivity and Diversity in CLIP through Continual Language Learning

While vision-language pre-trained models (VL-PTMs) have advanced multimodal research in recent years, their mastery in a few languages like English restricts their applicability in broader communities. To this end, there is an increasing interest in developing multilingual VL models via a joint-learning setup, which, however, could be unrealistic due to expensive costs and data availability. In this work, we propose to extend VL-PTMs' language capacity by continual language learning (CLL), where a model needs to update its linguistic knowledge incrementally without suffering from catastrophic forgetting (CF). We begin our study by introducing a model dubbed CLL-CLIP, which builds upon CLIP, a prevailing VL-PTM that has acquired image-English text alignment. Specifically, CLL-CLIP contains an expandable token embedding layer to handle linguistic differences. It solely trains token embeddings to improve memory stability and is optimized under cross-modal and cross-lingual objectives to learn the alignment between images and multilingual texts. To alleviate CF raised by covariate shift and lexical overlap, we further propose a novel approach that ensures the identical distribution of all token embeddings during initialization and regularizes token embedding learning during training. We construct a CLL benchmark covering 36 languages based on MSCOCO and XM3600 datasets and then evaluate multilingual image-text retrieval performance. Extensive experiments verify the effectiveness of CLL-CLIP and show that our approach can boost CLL-CLIP, e.g., by 6.7% in text-to-image average Recall@1 on XM3600, and improve various state-of-the-art methods consistently. Our code and data are available at \url{https://github.com/yangbang18/CLFM}.

翻译：尽管视觉-语言预训练模型近年来推动了多模态研究的进展，但其对英语等少数语言的掌握限制了其在更广泛社区中的应用。为此，通过联合学习框架开发多语言视觉-语言模型的研究日益兴起，然而由于高昂成本和数据可用性问题，这种方案可能不切实际。本文提出通过持续语言学习扩展视觉-语言预训练模型的语言能力，该范式要求模型在不遭受灾难性遗忘的前提下逐步更新语言知识。我们首先引入名为CLL-CLIP的模型，其基于主流视觉-语言预训练模型CLIP构建，已具备图像-英文文本对齐能力。具体而言，CLL-CLIP包含可扩展的Token嵌入层以处理语言差异，仅通过训练Token嵌入来提升记忆稳定性，并在跨模态与跨语言目标下优化以实现图像与多语言文本的对齐。为缓解由协变量偏移和词汇重叠引发的灾难性遗忘，我们进一步提出创新方法：在初始化时确保所有Token嵌入的分布一致，并在训练过程中对Token嵌入学习进行正则化。基于MSCOCO和XM3600数据集构建了覆盖36种语言的持续语言学习基准，并评估多语言图像-文本检索性能。大量实验验证了CLL-CLIP的有效性，表明本方法可提升CLL-CLIP性能（例如在XM3600数据集上文本到图像平均Recall@1提升6.7%），并持续改进多种现有最优方法。代码及数据已开源至\url{https://github.com/yangbang18/CLFM}。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日