We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL, through two key major upgrades. For the vision component, we incorporate a dynamic tiling vision encoding strategy designed for processing high-resolution images with different aspect ratios. For the language component, we leverage DeepSeekMoE models with the Multi-head Latent Attention mechanism, which compresses Key-Value cache into latent vectors, to enable efficient inference and high throughput. Trained on an improved vision-language dataset, DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively. DeepSeek-VL2 achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models. Codes and pre-trained models are publicly accessible at https://github.com/deepseek-ai/DeepSeek-VL2.
翻译:我们提出了DeepSeek-VL2,这是一个先进的专家混合大型视觉-语言模型系列,通过两项关键的重大升级,显著超越了其前身DeepSeek-VL。在视觉组件方面,我们采用了一种动态分块视觉编码策略,专为处理不同宽高比的高分辨率图像而设计。在语言组件方面,我们利用了具有多头潜在注意力机制的DeepSeekMoE模型,该机制将键值缓存压缩为潜在向量,从而实现高效推理和高吞吐量。通过在改进的视觉-语言数据集上进行训练,DeepSeek-VL2在多种任务上展现出卓越的能力,包括但不限于视觉问答、光学字符识别、文档/表格/图表理解和视觉定位。我们的模型系列包含三个变体:DeepSeek-VL2-Tiny、DeepSeek-VL2-Small和DeepSeek-VL2,分别具有10亿、28亿和45亿个激活参数。与现有的开源密集型模型和基于MoE的模型相比,在激活参数数量相似或更少的情况下,DeepSeek-VL2实现了具有竞争力或最先进的性能。代码和预训练模型可在 https://github.com/deepseek-ai/DeepSeek-VL2 公开获取。