Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension

Recent advances in Large Language Models (LLMs) have catalyzed the development of Large Multimodal Models (LMMs). However, existing research primarily focuses on tuning language and image instructions, ignoring the critical pretraining phase where models learn to process textual and visual modalities jointly. In this paper, we propose a new pretraining paradigm for LMMs to enhance the visual comprehension capabilities of LLMs by introducing a novel cross-modal comprehension stage. Specifically, we design a dynamically learnable prompt token pool and employ the Hungarian algorithm to replace part of the original visual tokens with the most relevant prompt tokens. Then, we conceptualize visual tokens as analogous to a "foreign language" for the LLMs and propose a mixed attention mechanism with bidirectional visual attention and unidirectional textual attention to comprehensively enhance the understanding of visual tokens. Meanwhile, we integrate a detailed caption generation task, leveraging rich descriptions to further facilitate LLMs in understanding visual semantic information. After pretraining on 1.5 million publicly accessible data, we present a new foundation model called Croc. Experimental results demonstrate that Croc achieves new state-of-the-art performance on massive vision-language benchmarks. To support reproducibility and facilitate further research, we release the training code and pre-trained model weights at https://github.com/deepglint/Croc.

翻译：近年来，大型语言模型（LLMs）的进展推动了大型多模态模型（LMMs）的发展。然而，现有研究主要集中于调整语言和图像指令，忽略了模型学习联合处理文本与视觉模态的关键预训练阶段。本文提出一种新的LMM预训练范式，通过引入新颖的跨模态理解阶段来增强LLMs的视觉理解能力。具体而言，我们设计了一个动态可学习的提示词元池，并采用匈牙利算法将部分原始视觉词元替换为最相关的提示词元。随后，我们将视觉词元概念化为LLMs的"外语"，并提出一种混合注意力机制，该机制结合双向视觉注意力与单向文本注意力，以全面增强对视觉词元的理解。同时，我们整合了详细描述生成任务，利用丰富的描述进一步促进LLMs理解视觉语义信息。在150万公开可访问数据上进行预训练后，我们提出了名为Croc的新基础模型。实验结果表明，Croc在大量视觉-语言基准测试中取得了新的最先进性能。为支持可复现性并促进进一步研究，我们在https://github.com/deepglint/Croc发布了训练代码与预训练模型权重。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日