Modular vision-language models (Vision-LLMs) align pretrained image encoders with frozen large language models (LLMs), representing a computationally much more efficient alternative to end-to-end training of large vision-language models from scratch, which is prohibitively expensive for most researchers and practitioners. Vision-LLMs instead post-hoc condition LLMs to `understand' the output of an image encoder. With the abundance of readily available high-quality English image-text data as well as monolingual English LLMs, the research focus has been on English-only Vision-LLMs. Multilingual vision-language models are still predominantly obtained via expensive end-to-end pretraining, resulting in comparatively smaller models, trained on limited multilingual image data supplemented with text-only multilingual corpora. In this work, we present mBLIP, the first multilingual Vision-LLM, which we obtain in a computationally efficient manner -- on consumer hardware and using only a few million training examples -- by leveraging a pretrained multilingual LLM. To this end, we \textit{re-align} an image encoder previously tuned to an English LLM to a new, multilingual LLM -- for this, we leverage multilingual data from a mix of vision-and-language tasks, which we obtain by machine-translating high-quality English data to 95 languages. On the IGLUE benchmark, mBLIP yields results competitive with state-of-the-art models. Moreover, in image captioning on XM3600, mBLIP (zero-shot) even outperforms PaLI-X (a model with 55B parameters). Compared to these very large multilingual vision-language models trained from scratch, we obtain mBLIP by training orders of magnitude fewer parameters on magnitudes less data. We release our model and code at \url{https://github.com/gregor-ge/mBLIP}.
翻译:模块化视觉-语言模型(Vision-LLMs)通过将预训练的图像编码器与冻结的大语言模型(LLMs)对齐,提供了一种计算上更为高效的替代方案,避免了从零开始端到端训练大型视觉-语言模型所需的高昂成本,后者对大多数研究人员和从业者而言代价过高。Vision-LLMs采用事后调整的方式,使LLMs能够"理解"图像编码器的输出。由于现成的高质量英文图像-文本数据以及单语英文LLMs的丰富性,研究重点一直集中在纯英文的Vision-LLMs上。多语言视觉-语言模型仍然主要通过昂贵的端到端预训练获得,导致模型规模相对较小,且训练数据仅限于有限的多语言图像数据以及纯文本的多语言语料库。在本工作中,我们提出了mBLIP,这是首个多语言Vision-LLM,通过利用预训练的多语言LLM,以计算高效的方式——在消费级硬件上仅使用数百万训练样本——获得该模型。为此,我们重新对齐了先前针对英文LLM进行调优的图像编码器,使其适配新的多语言LLM——我们通过将高质量英文数据机器翻译成95种语言,从视觉-语言任务的混合数据中获取多语言数据。在IGLUE基准测试中,mBLIP取得了与最先进模型竞争的结果。此外,在XM3600图像描述任务中,mBLIP(零样本)甚至超越了PaLI-X(一个拥有550亿参数的模型)。与这些从零开始训练的、规模巨大的多语言视觉-语言模型相比,我们通过训练参数数量少数个数量级、数据量也大幅缩减的方法获得了mBLIP。我们将在以下网址发布模型和代码:\url{https://github.com/gregor-ge/mBLIP}。