mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs

Modular vision-language models (Vision-LLMs) align pretrained image encoders with frozen large language models (LLMs), representing a computationally much more efficient alternative to end-to-end training of large vision-language models from scratch, which is prohibitively expensive for most researchers and practitioners. Vision-LLMs instead post-hoc condition LLMs to `understand' the output of an image encoder. With the abundance of readily available high-quality English image-text data as well as monolingual English LLMs, the research focus has been on English-only Vision-LLMs. Multilingual vision-language models are still predominantly obtained via expensive end-to-end pretraining, resulting in comparatively smaller models, trained on limited multilingual image data supplemented with text-only multilingual corpora. In this work, we present mBLIP, the first multilingual Vision-LLM, which we obtain in a computationally efficient manner -- on consumer hardware and using only a few million training examples -- by leveraging a pretrained multilingual LLM. To this end, we \textit{re-align} an image encoder previously tuned to an English LLM to a new, multilingual LLM -- for this, we leverage multilingual data from a mix of vision-and-language tasks, which we obtain by machine-translating high-quality English data to 95 languages. On the IGLUE benchmark, mBLIP yields results competitive with state-of-the-art models. Moreover, in image captioning on XM3600, mBLIP (zero-shot) even outperforms PaLI-X (a model with 55B parameters). Compared to these very large multilingual vision-language models trained from scratch, we obtain mBLIP by training orders of magnitude fewer parameters on magnitudes less data. We release our model and code at \url{https://github.com/gregor-ge/mBLIP}.

翻译：模块化视觉-语言模型（Vision-LLMs）通过将预训练的图像编码器与冻结的大语言模型（LLMs）对齐，提供了一种计算上更为高效的替代方案，避免了从零开始端到端训练大型视觉-语言模型所需的高昂成本，后者对大多数研究人员和从业者而言代价过高。Vision-LLMs采用事后调整的方式，使LLMs能够"理解"图像编码器的输出。由于现成的高质量英文图像-文本数据以及单语英文LLMs的丰富性，研究重点一直集中在纯英文的Vision-LLMs上。多语言视觉-语言模型仍然主要通过昂贵的端到端预训练获得，导致模型规模相对较小，且训练数据仅限于有限的多语言图像数据以及纯文本的多语言语料库。在本工作中，我们提出了mBLIP，这是首个多语言Vision-LLM，通过利用预训练的多语言LLM，以计算高效的方式——在消费级硬件上仅使用数百万训练样本——获得该模型。为此，我们重新对齐了先前针对英文LLM进行调优的图像编码器，使其适配新的多语言LLM——我们通过将高质量英文数据机器翻译成95种语言，从视觉-语言任务的混合数据中获取多语言数据。在IGLUE基准测试中，mBLIP取得了与最先进模型竞争的结果。此外，在XM3600图像描述任务中，mBLIP（零样本）甚至超越了PaLI-X（一个拥有550亿参数的模型）。与这些从零开始训练的、规模巨大的多语言视觉-语言模型相比，我们通过训练参数数量少数个数量级、数据量也大幅缩减的方法获得了mBLIP。我们将在以下网址发布模型和代码：\url{https://github.com/gregor-ge/mBLIP}。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日