Pre-trained vision and language models such as CLIP have witnessed remarkable success in connecting images and texts with a primary focus on English texts. Despite recent efforts to extend CLIP to support other languages, disparities in performance among different languages have been observed due to uneven resource availability. Additionally, current cross-lingual transfer methods of those pre-trained models would consume excessive resources for a large number of languages. Therefore, we propose a new parameter-efficient cross-lingual transfer learning framework that utilizes a translation-based alignment method to mitigate multilingual disparities and explores parameter-efficient fine-tuning methods for parameter-efficient cross-lingual transfer. Extensive experiments on XTD and Multi30K datasets, covering 11 languages under zero-shot, few-shot, and full-dataset learning scenarios, show that our framework significantly reduces the multilingual disparities among languages and improves cross-lingual transfer results, especially in low-resource scenarios, while only keeping and fine-tuning an extremely small number of parameters compared to the full model (e.g., Our framework only requires 0.16\% additional parameters of a full-model for each language in the few-shot learning scenario). The codes are available at \url{https://github.com/eric-ai-lab/PECTVLM}. The codes are available at \url{https://github.com/eric-ai-lab/PECTVLM}.
翻译:预训练的视觉与语言模型(如CLIP)在连接图像与文本方面取得了显著成功,但主要集中于英文文本。尽管近期有研究尝试将CLIP扩展至其他语言,但由于资源分布不均,不同语言间的性能差异依然存在。此外,当前针对这些预训练模型的跨语言迁移方法在处理大量语言时会消耗过多资源。为此,我们提出了一种新的参数高效跨语言迁移学习框架,通过基于翻译的对齐方法缓解多语言差异,并探索参数高效的微调方法以实现跨语言迁移的参数高效性。在XTD和Multi30K数据集上覆盖11种语言的零样本、少样本及全数据集学习场景的大量实验表明,我们的框架显著缩小了语言间的多语言差异,并提升了跨语言迁移结果,尤其在低资源场景中表现突出——同时仅需保留并微调极少量参数(例如,在少样本学习场景中,我们的框架为每个语言仅需增加全模型0.16%的参数)。代码已开源至\url{https://github.com/eric-ai-lab/PECTVLM}。