In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs. (2) Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448$\times$448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input. (3) High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks. We evaluate InternVL 1.5 through a series of benchmarks and comparative studies. Compared to both open-source and proprietary models, InternVL 1.5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks. Code has been released at https://github.com/OpenGVLab/InternVL.
翻译:在本报告中,我们介绍InternVL 1.5——一个开源多模态大语言模型(MLLM),旨在弥合开源与专有商业模型在多模态理解方面的能力差距。我们提出三项简单改进:(1)强视觉编码器:我们探索了大规模视觉基础模型——InternViT-6B的持续学习策略,提升了其视觉理解能力,并使其能够迁移并复用于不同的大语言模型。(2)动态高分辨率:根据输入图像的宽高比和分辨率,我们将图像划分为1至40个448×448像素的图块,支持高达4K分辨率的输入。(3)高质量双语数据集:我们精心收集了覆盖常见场景与文档图像的高质量双语数据集,并用英文和中文问答对进行标注,显著提升了OCR及中文相关任务的性能。我们通过一系列基准测试和对比研究评估了InternVL 1.5。与开源及专有模型相比,InternVL 1.5展现出具有竞争力的性能,在18项基准测试中的8项上达到最优结果。代码已发布于https://github.com/OpenGVLab/InternVL。