Recent advances in Large Language Models (LLMs) have catalyzed the development of Large Multimodal Models (LMMs). However, existing research primarily focuses on tuning language and image instructions, ignoring the critical pretraining phase where models learn to process textual and visual modalities jointly. In this paper, we propose a new pretraining paradigm for LMMs to enhance the visual comprehension capabilities of LLMs by introducing a novel cross-modal comprehension stage. Specifically, we design a dynamically learnable prompt token pool and employ the Hungarian algorithm to replace part of the original visual tokens with the most relevant prompt tokens. Then, we conceptualize visual tokens as analogous to a "foreign language" for the LLMs and propose a mixed attention mechanism with bidirectional visual attention and unidirectional textual attention to comprehensively enhance the understanding of visual tokens. Meanwhile, we integrate a detailed caption generation task, leveraging rich descriptions to further facilitate LLMs in understanding visual semantic information. After pretraining on 1.5 million publicly accessible data, we present a new foundation model called Croc. Experimental results demonstrate that Croc achieves new state-of-the-art performance on massive vision-language benchmarks. To support reproducibility and facilitate further research, we release the training code and pre-trained model weights at https://github.com/deepglint/Croc.
翻译:近年来,大型语言模型(LLMs)的进展推动了大型多模态模型(LMMs)的发展。然而,现有研究主要集中于调整语言和图像指令,忽略了模型学习联合处理文本与视觉模态的关键预训练阶段。本文提出一种新的LMM预训练范式,通过引入新颖的跨模态理解阶段来增强LLMs的视觉理解能力。具体而言,我们设计了一个动态可学习的提示词元池,并采用匈牙利算法将部分原始视觉词元替换为最相关的提示词元。随后,我们将视觉词元概念化为LLMs的"外语",并提出一种混合注意力机制,该机制结合双向视觉注意力与单向文本注意力,以全面增强对视觉词元的理解。同时,我们整合了详细描述生成任务,利用丰富的描述进一步促进LLMs理解视觉语义信息。在150万公开可访问数据上进行预训练后,我们提出了名为Croc的新基础模型。实验结果表明,Croc在大量视觉-语言基准测试中取得了新的最先进性能。为支持可复现性并促进进一步研究,我们在https://github.com/deepglint/Croc发布了训练代码与预训练模型权重。