The rapid advance of Large Language Models (LLMs) has catalyzed the development of Vision-Language Models (VLMs). Monolithic VLMs, which avoid modality-specific encoders, offer a promising alternative to the compositional ones but face the challenge of inferior performance. Most existing monolithic VLMs require tuning pre-trained LLMs to acquire vision abilities, which may degrade their language capabilities. To address this dilemma, this paper presents a novel high-performance monolithic VLM named HoVLE. We note that LLMs have been shown capable of interpreting images, when image embeddings are aligned with text embeddings. The challenge for current monolithic VLMs actually lies in the lack of a holistic embedding module for both vision and language inputs. Therefore, HoVLE introduces a holistic embedding module that converts visual and textual inputs into a shared space, allowing LLMs to process images in the same way as texts. Furthermore, a multi-stage training strategy is carefully designed to empower the holistic embedding module. It is first trained to distill visual features from a pre-trained vision encoder and text embeddings from the LLM, enabling large-scale training with unpaired random images and text tokens. The whole model further undergoes next-token prediction on multi-modal data to align the embeddings. Finally, an instruction-tuning stage is incorporated. Our experiments show that HoVLE achieves performance close to leading compositional models on various benchmarks, outperforming previous monolithic models by a large margin. Model available at https://huggingface.co/OpenGVLab/HoVLE.
翻译:大型语言模型(LLM)的快速发展推动了视觉语言模型(VLM)的演进。单体VLM避免了特定模态编码器的使用,为组合式模型提供了一种有前景的替代方案,但其性能表现往往不足。现有大多数单体VLM需要通过对预训练LLM进行调优来获得视觉能力,这可能损害其语言能力。为解决这一困境,本文提出了一种新型高性能单体VLM——HoVLE。我们注意到,当图像嵌入与文本嵌入对齐时,LLM已被证明具备解析图像的能力。当前单体VLM面临的挑战实际上在于缺乏一个能够同时处理视觉与语言输入的整体嵌入模块。因此,HoVLE引入了一个整体嵌入模块,将视觉和文本输入转换到共享空间中,使LLM能够以处理文本相同的方式处理图像。此外,本文精心设计了一种多阶段训练策略以增强该整体嵌入模块:首先通过从预训练视觉编码器提取视觉特征、从LLM提取文本嵌入进行训练,从而支持使用非配对的随机图像与文本标记进行大规模训练;随后整个模型在多模态数据上进行下一标记预测以实现嵌入对齐;最后引入指令微调阶段。实验表明,HoVLE在多项基准测试中取得了接近领先组合式模型的性能,大幅超越了先前的单体模型。模型发布于 https://huggingface.co/OpenGVLab/HoVLE。