Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyond the conventional image description and question-answering, we implement the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples. The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales on a broad range of visual-centric benchmarks (e.g., image captioning, question answering, visual grounding) and different settings (e.g., zero-shot, few-shot). Moreover, on real-world dialog benchmarks, our instruction-tuned Qwen-VL-Chat also demonstrates superiority compared to existing vision-language chatbots. Code, demo and models are available at https://github.com/QwenLM/Qwen-VL.

翻译：在本文中，我们介绍了Qwen-VL系列——一组旨在感知和理解文本与图像的大规模视觉语言模型（LVLMs）。以Qwen-LM为基础，我们通过精心设计的（i）视觉接收器、（ii）输入输出接口、（iii）三阶段训练流程以及（iv）多语言多模态清洗语料库，为其赋予视觉能力。除常规的图像描述和问答任务外，我们通过对齐图像-描述-边界框三元组，实现了Qwen-VL的定位与文本读取能力。由此产生的模型（包括Qwen-VL和Qwen-VL-Chat）在多种视觉导向基准（如图像描述、问答、视觉定位）及不同设置（如零样本、少样本）下，均以相近模型规模刷新了通用型模型的最优记录。此外，在真实对话基准测试中，经指令微调的Qwen-VL-Chat相较于现有视觉语言聊天机器人展现出显著优势。代码、演示及模型已开源至 https://github.com/QwenLM/Qwen-VL。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日