We introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both text and images. Comprising Qwen-VL and Qwen-VL-Chat, these models exhibit remarkable performance in tasks like image captioning, question answering, visual localization, and flexible interaction. The evaluation covers a wide range of tasks including zero-shot captioning, visual or document visual question answering, and grounding. We demonstrate the Qwen-VL outperforms existing LVLMs. We present their architecture, training, capabilities, and performance, highlighting their contributions to advancing multimodal artificial intelligence. Code, demo and models are available at https://github.com/QwenLM/Qwen-VL.
翻译:我们提出了Qwen-VL系列,这是一组大规模视觉语言模型(LVLMs),旨在感知和理解文本与图像。该系列包括Qwen-VL和Qwen-VL-Chat,这些模型在图像描述、问答、视觉定位和灵活交互等任务中表现出色。评估涵盖了广泛的任务,包括零样本描述、视觉或文档视觉问答以及定位。我们展示了Qwen-VL优于现有的LVLMs。我们介绍了其架构、训练过程、能力及性能,强调了它们对推动多模态人工智能发展的贡献。代码、演示和模型可在https://github.com/QwenLM/Qwen-VL获取。