Vision-Language Pre-training (VLP) has advanced the performance of many vision-language tasks, such as image-text retrieval, visual entailment, and visual reasoning. The pre-training mostly utilizes lexical databases and image queries in English. Previous work has demonstrated that the pre-training in English does not transfer well to other languages in a zero-shot setting. However, multilingual pre-trained language models (MPLM) have excelled at a variety of single-modal language tasks. In this paper, we propose a simple yet efficient approach to adapt VLP to unseen languages using MPLM. We utilize a cross-lingual contextualized token embeddings alignment approach to train text encoders for non-English languages. Our approach does not require image input and primarily uses machine translation, eliminating the need for target language data. Our evaluation across three distinct tasks (image-text retrieval, visual entailment, and natural language visual reasoning) demonstrates that this approach outperforms the state-of-the-art multilingual vision-language models without requiring large parallel corpora. Our code is available at https://github.com/Yasminekaroui/CliCoTea.
翻译:视觉-语言预训练(VLP)提升了诸多视觉-语言任务的性能,如图像-文本检索、视觉蕴含和视觉推理。预训练主要使用英语词汇数据库和图像查询。先前研究表明,在零样本设置下,英语预训练无法很好地迁移至其他语言。然而,多语言预训练语言模型(MPLM)在多种单模态语言任务中表现出色。本文提出一种简单高效的方法,利用MPLM使VLP适配未见语言。我们采用跨语言上下文化词元嵌入对齐方法来训练非英语语言的文本编码器。该方法无需图像输入,主要依赖机器翻译,从而避免了目标语言数据的需求。我们在三个不同任务(图像-文本检索、视觉蕴含和自然语言视觉推理)上的评估表明,该方法无需大规模平行语料库即可超越当前最先进的多语言视觉-语言模型。我们的代码开源在https://github.com/Yasminekaroui/CliCoTea。