DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu,Wen Liu,Bo Zhang,Bingxuan Wang,Kai Dong,Bo Liu,Jingxiang Sun,Tongzheng Ren,Zhuoshu Li,Hao Yang,Yaofeng Sun,Chengqi Deng,Hanwei Xu,Zhenda Xie,Chong Ruan

from arxiv, https://github.com/deepseek-ai/DeepSeek-VL

We present DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. Our approach is structured around three key dimensions: We strive to ensure our data is diverse, scalable, and extensively covers real-world scenarios including web screenshots, PDFs, OCR, charts, and knowledge-based content, aiming for a comprehensive representation of practical contexts. Further, we create a use case taxonomy from real user scenarios and construct an instruction tuning dataset accordingly. The fine-tuning with this dataset substantially improves the model's user experience in practical applications. Considering efficiency and the demands of most real-world scenarios, DeepSeek-VL incorporates a hybrid vision encoder that efficiently processes high-resolution images (1024 x 1024), while maintaining a relatively low computational overhead. This design choice ensures the model's ability to capture critical semantic and detailed information across various visual tasks. We posit that a proficient Vision-Language Model should, foremost, possess strong language abilities. To ensure the preservation of LLM capabilities during pretraining, we investigate an effective VL pretraining strategy by integrating LLM training from the beginning and carefully managing the competitive dynamics observed between vision and language modalities. The DeepSeek-VL family (both 1.3B and 7B models) showcases superior user experiences as a vision-language chatbot in real-world applications, achieving state-of-the-art or competitive performance across a wide range of visual-language benchmarks at the same model size while maintaining robust performance on language-centric benchmarks. We have made both 1.3B and 7B models publicly accessible to foster innovations based on this foundation model.

翻译：我们提出了DeepSeek-VL，这是一个为真实世界视觉与语言理解应用设计的开源视觉语言模型。我们的方法围绕三个关键维度展开：首先，我们致力于确保数据多样化、可扩展，并广泛覆盖真实场景（包括网页截图、PDF、OCR、图表和知识型内容），旨在全面呈现实际应用中的上下文。其次，我们从真实用户场景中构建用例分类体系，并据此构造指令微调数据集。通过该数据集进行微调，显著提升了模型在实际应用中的用户体验。考虑到效率及大多数真实场景的需求，DeepSeek-VL采用了混合视觉编码器，能以较低的计算开销高效处理高分辨率图像（1024×1024）。这一设计确保了模型在不同视觉任务中捕捉关键语义与细节信息的能力。我们认为，一个优秀的视觉语言模型首先应具备强大的语言能力。为在预训练过程中保留大语言模型能力，我们探索了一种有效的视觉语言预训练策略：即在训练初期集成大语言模型训练，并审慎管理视觉与语言模态之间的竞争动态。DeepSeek-VL系列（含1.3B与7B模型）作为视觉语言聊天机器人在真实应用中展现了卓越的用户体验，在同等模型规模下，其于多项视觉语言基准测试中达到或超越最优性能，同时保持语言类基准测试的稳健表现。我们已公开1.3B与7B两个模型，以促进基于该基础模型的创新研究。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日