Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang,Shuai Bai,Sinan Tan,Shijie Wang,Zhihao Fan,Jinze Bai,Keqin Chen,Xuejing Liu,Jialin Wang,Wenbin Ge,Yang Fan,Kai Dang,Mengfei Du,Xuancheng Ren,Rui Men,Dayiheng Liu,Chang Zhou,Jingren Zhou,Junyang Lin

from arxiv, Code is available at https://github.com/QwenLM/Qwen2-VL. arXiv admin note: text overlap with arXiv:2408.15262 by other authors

We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at https://github.com/QwenLM/Qwen2-VL .

翻译：我们提出了Qwen2-VL系列模型，这是对先前Qwen-VL模型的重大升级，重新定义了视觉处理中传统的固定分辨率方法。Qwen2-VL引入了朴素动态分辨率机制，使模型能够将不同分辨率的图像动态处理为不同数量的视觉标记。这种方法使模型能够生成更高效、更准确的视觉表征，更贴近人类的感知过程。该模型还集成了多模态旋转位置编码，促进了文本、图像和视频之间位置信息的有效融合。我们采用统一的范式处理图像和视频，增强了模型的视觉感知能力。为探索大型多模态模型的潜力，Qwen2-VL研究了大型视觉语言模型的缩放规律。通过同时扩展模型规模（提供2B、8B和72B参数版本）和训练数据量，Qwen2-VL系列实现了极具竞争力的性能。值得注意的是，Qwen2-VL-72B模型在各种多模态基准测试中取得了与GPT-4o和Claude3.5-Sonnet等领先模型相当的结果，超越了其他通用模型。代码可在https://github.com/QwenLM/Qwen2-VL获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日