Modular vision-language models (Vision-LLMs) align pretrained image encoders with (pretrained) large language models (LLMs), representing a computationally much more efficient alternative to end-to-end training of large vision-language models from scratch, which is prohibitively expensive for most. Vision-LLMs instead post-hoc condition LLMs to `understand' the output of an image encoder. With the abundance of readily available high-quality English image-text data as well as monolingual English LLMs, the research focus has been on English-only Vision-LLMs. Multilingual vision-language models are still predominantly obtained via expensive end-to-end pretraining, resulting in comparatively smaller models, trained on limited multilingual image data supplemented with text-only multilingual corpora. In this work, we present mBLIP, the first multilingual Vision-LLM, which we obtain in a computationally efficient manner -- on consumer hardware using only a few million training examples -- by leveraging a pretrained multilingual LLM. To this end, we \textit{re-align} an image encoder previously tuned to an English LLM to a new, multilingual LLM -- for this, we leverage multilingual data from a mix of vision-and-language tasks, which we obtain by machine-translating high-quality English data to 95 languages. On the IGLUE benchmark, mBLIP yields results competitive with state-of-the-art models. Moreover, in image captioning on XM3600, mBLIP (zero-shot) even outperforms PaLI-X (a model with 55B parameters). Compared to these very large multilingual vision-language models trained from scratch, we obtain mBLIP by training orders of magnitude fewer parameters on magnitudes less data. We release our model and code at \url{https://github.com/gregor-ge/mBLIP}.
翻译:模块化视觉-语言模型(Vision-LLMs)通过对齐预训练图像编码器与(预训练)大语言模型(LLMs),相比从零开始端到端训练大型视觉-语言模型(对大多数研究者而言成本高昂),提供了一种计算效率更高的替代方案。Vision-LLMs采用后验方式将LLMs调整至能"理解"图像编码器输出。由于现有大量高质量英语图像-文本数据及单语英语LLMs,研究重点长期集中于纯英语Vision-LLMs。当前多语言视觉-语言模型仍主要通过昂贵的端到端预训练获得,导致模型规模相对较小,且训练数据仅包含有限的多语言图像数据及纯文本多语言语料库。本研究提出mBLIP——首个多语言Vision-LLM,通过利用预训练多语言LLM,在消费级硬件上仅需数百万训练样本即可高效实现。为此,我们将先前针对英语LLM调优的图像编码器\textit{重新对齐}至新型多语言LLM——这一过程利用了从视觉-语言任务混合数据中获取的多语言数据(通过将高质量英语数据机器翻译至95种语言获得)。在IGLUE基准测试中,mBLIP取得了与现有最优模型相媲美的结果。更值得注意的是,在XM3600图像描述任务中,mBLIP(零样本)甚至优于PaLI-X(参数量达55B的模型)。相比这些从零开始训练的超大型多语言视觉-语言模型,mBLIP以数量级更少的参数和训练数据实现了同等性能。我们已在\url{https://github.com/gregor-ge/mBLIP}开源模型与代码。