We introduce the Qwen-VL series, a set of large-scale vision-language models designed to perceive and understand both text and images. Comprising Qwen-VL and Qwen-VL-Chat, these models exhibit remarkable performance in tasks like image captioning, question answering, visual localization, and flexible interaction. The evaluation covers a wide range of tasks including zero-shot captioning, visual or document visual question answering, and grounding. We demonstrate the Qwen-VL outperforms existing Large Vision Language Models (LVLMs). We present their architecture, training, capabilities, and performance, highlighting their contributions to advancing multimodal artificial intelligence. Code, demo and models are available at https://github.com/QwenLM/Qwen-VL.
翻译:我们推出了Qwen-VL系列,这是一组旨在感知和理解文本与图像的大型视觉语言模型。该系列包含Qwen-VL和Qwen-VL-Chat两个模型,在图像描述、问答、视觉定位及灵活交互等任务中展现出卓越性能。评估覆盖了多项任务,包括零样本描述生成、视觉或文档视觉问答以及指代定位。我们证明,Qwen-VL的性能优于现有大型视觉语言模型(LVLMs)。本文介绍了其架构、训练方法、能力及性能,重点阐述了它们在推动多模态人工智能发展方面的贡献。相关代码、演示及模型已发布于https://github.com/QwenLM/Qwen-VL。