Current Multimodal Large Language Models (MLLMs) typically integrate a pre-trained LLM with another pre-trained vision transformer through a connector, such as an MLP, endowing the LLM with visual capabilities. However, the misalignment between two embedding strategies in MLLMs -- the structural textual embeddings based on an embedding look-up table and the continuous embeddings generated directly by the vision encoder -- makes challenges for a more seamless fusion of visual and textual information. We propose Ovis, a novel MLLM architecture designed to structurally align visual and textual embeddings. Ovis integrates an additional learnable visual embedding table into the visual encoder's process. To capture rich visual semantics, each image patch indexes the visual embedding table multiple times, resulting in a final visual embedding that is a probabilistic combination of the indexed embeddings. This structural approach mirrors the method used for generating textual embeddings. Empirical evaluations on various multimodal benchmarks demonstrate that Ovis outperforms open-source MLLMs of similar parameter scales and even surpasses the proprietary model Qwen-VL-Plus overall. These results highlight the potential of Ovis' structured visual representation for advancing MLLM architectural design and promoting more effective multimodal learning. Both the source code and the training dataset of Ovis will be made publicly available.
翻译:当前的多模态大语言模型通常通过一个连接器(如MLP)将预训练的大语言模型与另一个预训练的视觉Transformer相结合,从而赋予LLM视觉能力。然而,MLLM中两种嵌入策略之间的不对齐——基于嵌入查找表的结构化文本嵌入与视觉编码器直接生成的连续嵌入——给视觉与文本信息的更无缝融合带来了挑战。我们提出Ovis,一种新颖的MLLM架构,旨在实现视觉与文本嵌入的结构化对齐。Ovis在视觉编码器的处理过程中集成了一个额外的可学习视觉嵌入表。为捕捉丰富的视觉语义,每个图像块多次索引该视觉嵌入表,最终生成的视觉嵌入是所索引嵌入的概率组合。这种结构化方法模拟了生成文本嵌入所采用的策略。在多种多模态基准上的实证评估表明,Ovis在相似参数规模的开源MLLM中表现优异,甚至整体上超越了专有模型Qwen-VL-Plus。这些结果凸显了Ovis的结构化视觉表示在推进MLLM架构设计和促进更有效的多模态学习方面的潜力。Ovis的源代码与训练数据集将公开提供。