In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyond the conventional image description and question-answering, we implement the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples. The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales on a broad range of visual-centric benchmarks (e.g., image captioning, question answering, visual grounding) and different settings (e.g., zero-shot, few-shot). Moreover, on real-world dialog benchmarks, our instruction-tuned Qwen-VL-Chat also demonstrates superiority compared to existing vision-language chatbots. Code, demo and models are available at https://github.com/QwenLM/Qwen-VL.
翻译:在本文中,我们介绍了Qwen-VL系列——一组旨在感知和理解文本与图像的大规模视觉语言模型(LVLMs)。以Qwen-LM为基础,我们通过精心设计的(i)视觉接收器、(ii)输入输出接口、(iii)三阶段训练流程以及(iv)多语言多模态清洗语料库,为其赋予视觉能力。除常规的图像描述和问答任务外,我们通过对齐图像-描述-边界框三元组,实现了Qwen-VL的定位与文本读取能力。由此产生的模型(包括Qwen-VL和Qwen-VL-Chat)在多种视觉导向基准(如图像描述、问答、视觉定位)及不同设置(如零样本、少样本)下,均以相近模型规模刷新了通用型模型的最优记录。此外,在真实对话基准测试中,经指令微调的Qwen-VL-Chat相较于现有视觉语言聊天机器人展现出显著优势。代码、演示及模型已开源至 https://github.com/QwenLM/Qwen-VL。