We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-form text-image composition and comprehension. This model goes beyond conventional vision-language understanding, adeptly crafting interleaved text-image content from diverse inputs like outlines, detailed textual specifications, and reference images, enabling highly customizable content creation. InternLM-XComposer2 proposes a Partial LoRA (PLoRA) approach that applies additional LoRA parameters exclusively to image tokens to preserve the integrity of pre-trained language knowledge, striking a balance between precise vision understanding and text composition with literary talent. Experimental results demonstrate the superiority of InternLM-XComposer2 based on InternLM2-7B in producing high-quality long-text multi-modal content and its exceptional vision-language understanding performance across various benchmarks, where it not only significantly outperforms existing multimodal models but also matches or even surpasses GPT-4V and Gemini Pro in certain assessments. This highlights its remarkable proficiency in the realm of multimodal understanding. The InternLM-XComposer2 model series with 7B parameters are publicly available at https://github.com/InternLM/InternLM-XComposer.
翻译:我们提出InternLM-XComposer2,这是一款在自由形式图文创作与理解方面表现卓越的前沿视觉-语言模型。该模型超越了传统的视觉-语言理解能力,能够根据大纲、详细文本说明和参考图像等多样化输入,灵巧地创作穿插文本与图像的内容,从而实现高度可定制化的内容生成。InternLM-XComposer2提出了一种部分LoRA(PLoRA)方法,该方法仅对图像令牌应用额外的LoRA参数,以保持预训练语言知识的完整性,从而在精准的视觉理解与富有文采的文本创作之间达成平衡。实验结果表明,基于InternLM-2-7B的InternLM-XComposer2在生成高质量长文本多模态内容方面具有优越性,并在多个基准测试中展现出卓越的视觉-语言理解性能,不仅显著优于现有多模态模型,还在某些评估中达到甚至超过了GPT-4V和Gemini Pro。这突显了其在多模态理解领域的卓越能力。拥有7B参数的InternLM-XComposer2模型系列已在https://github.com/InternLM/InternLM-XComposer公开提供。