The exponential growth of large language models (LLMs) has opened up numerous possibilities for multi-modal AGI systems. However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs. In this work, we design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters and progressively aligns it with the large language model, using web-scale image-text data from various sources. This model can be broadly applied to and achieve state-of-the-art performance on visual perception tasks such as image-level or pixel-level recognition, vision-language tasks such as zero-shot image/video classification, zero-shot image/video-text retrieval, and link with LLMs to create multi-modal dialogue systems. We hope that our research could contribute to the development of multi-modal large models. Code and models are available at https://github.com/OpenGVLab/InternVL.
翻译:大语言模型(LLMs)的指数级增长为多模态通用人工智能系统开辟了众多可能性。然而,作为多模态通用人工智能关键组成部分的视觉及视觉-语言基础模型的发展却滞后于LLMs。在本研究中,我们设计了一个大规模视觉-语言基础模型(InternVL),将视觉基础模型扩展至60亿参数,并利用来自不同来源的网络规模图像-文本数据,逐步将其与大语言模型对齐。该模型可广泛适用于多种任务,并在图像级或像素级识别等视觉感知任务、零样本图像/视频分类、零样本图像/视频-文本检索等视觉-语言任务中取得最先进性能,还可与LLMs结合构建多模态对话系统。我们希望这项研究能为多模态大模型的发展做出贡献。代码和模型已开源至https://github.com/OpenGVLab/InternVL。