HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

The rapid advance of Large Language Models (LLMs) has catalyzed the development of Vision-Language Models (VLMs). Monolithic VLMs, which avoid modality-specific encoders, offer a promising alternative to the compositional ones but face the challenge of inferior performance. Most existing monolithic VLMs require tuning pre-trained LLMs to acquire vision abilities, which may degrade their language capabilities. To address this dilemma, this paper presents a novel high-performance monolithic VLM named HoVLE. We note that LLMs have been shown capable of interpreting images, when image embeddings are aligned with text embeddings. The challenge for current monolithic VLMs actually lies in the lack of a holistic embedding module for both vision and language inputs. Therefore, HoVLE introduces a holistic embedding module that converts visual and textual inputs into a shared space, allowing LLMs to process images in the same way as texts. Furthermore, a multi-stage training strategy is carefully designed to empower the holistic embedding module. It is first trained to distill visual features from a pre-trained vision encoder and text embeddings from the LLM, enabling large-scale training with unpaired random images and text tokens. The whole model further undergoes next-token prediction on multi-modal data to align the embeddings. Finally, an instruction-tuning stage is incorporated. Our experiments show that HoVLE achieves performance close to leading compositional models on various benchmarks, outperforming previous monolithic models by a large margin. Model available at https://huggingface.co/OpenGVLab/HoVLE.

翻译：大型语言模型（LLM）的快速发展推动了视觉语言模型（VLM）的演进。单体VLM避免了特定模态编码器的使用，为组合式模型提供了一种有前景的替代方案，但其性能表现往往不足。现有大多数单体VLM需要通过对预训练LLM进行调优来获得视觉能力，这可能损害其语言能力。为解决这一困境，本文提出了一种新型高性能单体VLM——HoVLE。我们注意到，当图像嵌入与文本嵌入对齐时，LLM已被证明具备解析图像的能力。当前单体VLM面临的挑战实际上在于缺乏一个能够同时处理视觉与语言输入的整体嵌入模块。因此，HoVLE引入了一个整体嵌入模块，将视觉和文本输入转换到共享空间中，使LLM能够以处理文本相同的方式处理图像。此外，本文精心设计了一种多阶段训练策略以增强该整体嵌入模块：首先通过从预训练视觉编码器提取视觉特征、从LLM提取文本嵌入进行训练，从而支持使用非配对的随机图像与文本标记进行大规模训练；随后整个模型在多模态数据上进行下一标记预测以实现嵌入对齐；最后引入指令微调阶段。实验表明，HoVLE在多项基准测试中取得了接近领先组合式模型的性能，大幅超越了先前的单体模型。模型发布于 https://huggingface.co/OpenGVLab/HoVLE。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日