In this work, we introduce Libra, a prototype model with a decoupled vision system on a large language model (LLM). The decoupled vision system decouples inner-modal modeling and cross-modal interaction, yielding unique visual information modeling and effective cross-modal comprehension. Libra is trained through discrete auto-regressive modeling on both vision and language inputs. Specifically, we incorporate a routed visual expert with a cross-modal bridge module into a pretrained LLM to route the vision and language flows during attention computing to enable different attention patterns in inner-modal modeling and cross-modal interaction scenarios. Experimental results demonstrate that the dedicated design of Libra achieves a strong MLLM baseline that rivals existing works in the image-to-text scenario with merely 50 million training data, providing a new perspective for future multimodal foundation models. Code is available at https://github.com/YifanXu74/Libra.
翻译:本文介绍Libra,一种在大语言模型上配备解耦视觉系统的原型模型。解耦视觉系统将模态内建模与跨模态交互进行解耦,从而实现独特的视觉信息建模与高效的跨模态理解。Libra通过对视觉和语言输入进行离散自回归建模来训练。具体而言,我们向预训练大语言模型中引入路由视觉专家与跨模态桥接模块,在注意力计算过程中分流视觉与语言信息流,使模态内建模和跨模态交互场景下能够采用不同的注意力模式。实验结果表明,Libra的专用设计仅需5000万训练数据即可达到与现有工作在图像到文本场景中相媲美的强多模态大语言模型基线,为未来多模态基础模型提供了新视角。代码已开源:https://github.com/YifanXu74/Libra。