We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at https://github.com/QwenLM/Qwen2-VL .
翻译:我们提出了Qwen2-VL系列模型,这是对先前Qwen-VL模型的重大升级,重新定义了视觉处理中传统的固定分辨率方法。Qwen2-VL引入了朴素动态分辨率机制,使模型能够将不同分辨率的图像动态处理为不同数量的视觉标记。这种方法使模型能够生成更高效、更准确的视觉表征,更贴近人类的感知过程。该模型还集成了多模态旋转位置编码,促进了文本、图像和视频之间位置信息的有效融合。我们采用统一的范式处理图像和视频,增强了模型的视觉感知能力。为探索大型多模态模型的潜力,Qwen2-VL研究了大型视觉语言模型的缩放规律。通过同时扩展模型规模(提供2B、8B和72B参数版本)和训练数据量,Qwen2-VL系列实现了极具竞争力的性能。值得注意的是,Qwen2-VL-72B模型在各种多模态基准测试中取得了与GPT-4o和Claude3.5-Sonnet等领先模型相当的结果,超越了其他通用模型。代码可在https://github.com/QwenLM/Qwen2-VL获取。