InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

Xiaoyi Dong,Pan Zhang,Yuhang Zang,Yuhang Cao,Bin Wang,Linke Ouyang,Xilin Wei,Songyang Zhang,Haodong Duan,Maosong Cao,Wenwei Zhang,Yining Li,Hang Yan,Yang Gao,Xinyue Zhang,Wei Li,Jingwen Li,Kai Chen,Conghui He,Xingcheng Zhang,Yu Qiao,Dahua Lin,Jiaqi Wang

from arxiv, Code and models are available at https://github.com/InternLM/InternLM-XComposer

We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-form text-image composition and comprehension. This model goes beyond conventional vision-language understanding, adeptly crafting interleaved text-image content from diverse inputs like outlines, detailed textual specifications, and reference images, enabling highly customizable content creation. InternLM-XComposer2 proposes a Partial LoRA (PLoRA) approach that applies additional LoRA parameters exclusively to image tokens to preserve the integrity of pre-trained language knowledge, striking a balance between precise vision understanding and text composition with literary talent. Experimental results demonstrate the superiority of InternLM-XComposer2 based on InternLM2-7B in producing high-quality long-text multi-modal content and its exceptional vision-language understanding performance across various benchmarks, where it not only significantly outperforms existing multimodal models but also matches or even surpasses GPT-4V and Gemini Pro in certain assessments. This highlights its remarkable proficiency in the realm of multimodal understanding. The InternLM-XComposer2 model series with 7B parameters are publicly available at https://github.com/InternLM/InternLM-XComposer.

翻译：我们提出InternLM-XComposer2，这是一款在自由形式图文创作与理解方面表现卓越的前沿视觉-语言模型。该模型超越了传统的视觉-语言理解能力，能够根据大纲、详细文本说明和参考图像等多样化输入，灵巧地创作穿插文本与图像的内容，从而实现高度可定制化的内容生成。InternLM-XComposer2提出了一种部分LoRA（PLoRA）方法，该方法仅对图像令牌应用额外的LoRA参数，以保持预训练语言知识的完整性，从而在精准的视觉理解与富有文采的文本创作之间达成平衡。实验结果表明，基于InternLM-2-7B的InternLM-XComposer2在生成高质量长文本多模态内容方面具有优越性，并在多个基准测试中展现出卓越的视觉-语言理解性能，不仅显著优于现有多模态模型，还在某些评估中达到甚至超过了GPT-4V和Gemini Pro。这突显了其在多模态理解领域的卓越能力。拥有7B参数的InternLM-XComposer2模型系列已在https://github.com/InternLM/InternLM-XComposer公开提供。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日