We present DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. Our approach is structured around three key dimensions: We strive to ensure our data is diverse, scalable, and extensively covers real-world scenarios including web screenshots, PDFs, OCR, charts, and knowledge-based content, aiming for a comprehensive representation of practical contexts. Further, we create a use case taxonomy from real user scenarios and construct an instruction tuning dataset accordingly. The fine-tuning with this dataset substantially improves the model's user experience in practical applications. Considering efficiency and the demands of most real-world scenarios, DeepSeek-VL incorporates a hybrid vision encoder that efficiently processes high-resolution images (1024 x 1024), while maintaining a relatively low computational overhead. This design choice ensures the model's ability to capture critical semantic and detailed information across various visual tasks. We posit that a proficient Vision-Language Model should, foremost, possess strong language abilities. To ensure the preservation of LLM capabilities during pretraining, we investigate an effective VL pretraining strategy by integrating LLM training from the beginning and carefully managing the competitive dynamics observed between vision and language modalities. The DeepSeek-VL family (both 1.3B and 7B models) showcases superior user experiences as a vision-language chatbot in real-world applications, achieving state-of-the-art or competitive performance across a wide range of visual-language benchmarks at the same model size while maintaining robust performance on language-centric benchmarks. We have made both 1.3B and 7B models publicly accessible to foster innovations based on this foundation model.
翻译:我们提出DeepSeek-VL,一个面向真实世界视觉与语言理解应用的开源视觉-语言(VL)模型。我们的方法围绕三个关键维度展开:首先,我们确保数据的多样性、可扩展性,并广泛覆盖包括网页截图、PDF、OCR、图表以及基于知识的内容等真实场景,旨在全面呈现实际应用环境。其次,我们从真实用户场景中构建用例分类体系,并据此构建指令微调数据集。该数据集的微调显著提升了模型在实践应用中的用户体验。考虑到效率及多数真实场景的需求,DeepSeek-VL采用混合视觉编码器,能够高效处理高分辨率图像(1024 x 1024),同时保持相对较低的计算开销。这一设计选择确保了模型在各类视觉任务中捕获关键语义与细节信息的能力。我们认为,一个优秀的视觉-语言模型首先应具备强大的语言能力。为在预训练过程中保留大语言模型(LLM)的能力,我们通过从初始阶段融入LLM训练并精心管理视觉与语言模态间的竞争动态,探索了一种有效的VL预训练策略。DeepSeek-VL系列(包含1.3B和7B两种模型)在真实应用中作为视觉-语言聊天机器人展现了卓越的用户体验,在相同模型尺寸下,于广泛的视觉-语言基准上取得领先或具有竞争力的性能,同时在以语言为中心的基准测试中保持稳健表现。我们已公开提供1.3B和7B两种模型,以促进基于该基础模型的创新研究。