mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs

Modular vision-language models (Vision-LLMs) align pretrained image encoders with (pretrained) large language models (LLMs), representing a computationally much more efficient alternative to end-to-end training of large vision-language models from scratch, which is prohibitively expensive for most. Vision-LLMs instead post-hoc condition LLMs to `understand' the output of an image encoder. With the abundance of readily available high-quality English image-text data as well as monolingual English LLMs, the research focus has been on English-only Vision-LLMs. Multilingual vision-language models are still predominantly obtained via expensive end-to-end pretraining, resulting in comparatively smaller models, trained on limited multilingual image data supplemented with text-only multilingual corpora. In this work, we present mBLIP, the first multilingual Vision-LLM, which we obtain in a computationally efficient manner -- on consumer hardware using only a few million training examples -- by leveraging a pretrained multilingual LLM. To this end, we \textit{re-align} an image encoder previously tuned to an English LLM to a new, multilingual LLM -- for this, we leverage multilingual data from a mix of vision-and-language tasks, which we obtain by machine-translating high-quality English data to 95 languages. On the IGLUE benchmark, mBLIP yields results competitive with state-of-the-art models. Moreover, in image captioning on XM3600, mBLIP (zero-shot) even outperforms PaLI-X (a model with 55B parameters). Compared to these very large multilingual vision-language models trained from scratch, we obtain mBLIP by training orders of magnitude fewer parameters on magnitudes less data. We release our model and code at \url{https://github.com/gregor-ge/mBLIP}.

翻译：模块化视觉-语言模型（Vision-LLMs）通过对齐预训练图像编码器与（预训练）大语言模型（LLMs），相比从零开始端到端训练大型视觉-语言模型（对大多数研究者而言成本高昂），提供了一种计算效率更高的替代方案。Vision-LLMs采用后验方式将LLMs调整至能"理解"图像编码器输出。由于现有大量高质量英语图像-文本数据及单语英语LLMs，研究重点长期集中于纯英语Vision-LLMs。当前多语言视觉-语言模型仍主要通过昂贵的端到端预训练获得，导致模型规模相对较小，且训练数据仅包含有限的多语言图像数据及纯文本多语言语料库。本研究提出mBLIP——首个多语言Vision-LLM，通过利用预训练多语言LLM，在消费级硬件上仅需数百万训练样本即可高效实现。为此，我们将先前针对英语LLM调优的图像编码器\textit{重新对齐}至新型多语言LLM——这一过程利用了从视觉-语言任务混合数据中获取的多语言数据（通过将高质量英语数据机器翻译至95种语言获得）。在IGLUE基准测试中，mBLIP取得了与现有最优模型相媲美的结果。更值得注意的是，在XM3600图像描述任务中，mBLIP（零样本）甚至优于PaLI-X（参数量达55B的模型）。相比这些从零开始训练的超大型多语言视觉-语言模型，mBLIP以数量级更少的参数和训练数据实现了同等性能。我们已在\url{https://github.com/gregor-ge/mBLIP}开源模型与代码。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日