This paper presents DreamLLM, a learning framework that first achieves versatile Multimodal Large Language Models (MLLMs) empowered with frequently overlooked synergy between multimodal comprehension and creation. DreamLLM operates on two fundamental principles. The first focuses on the generative modeling of both language and image posteriors by direct sampling in the raw multimodal space. This approach circumvents the limitations and information loss inherent to external feature extractors like CLIP, and a more thorough multimodal understanding is obtained. Second, DreamLLM fosters the generation of raw, interleaved documents, modeling both text and image contents, along with unstructured layouts. This allows DreamLLM to learn all conditional, marginal, and joint multimodal distributions effectively. As a result, DreamLLM is the first MLLM capable of generating free-form interleaved content. Comprehensive experiments highlight DreamLLM's superior performance as a zero-shot multimodal generalist, reaping from the enhanced learning synergy.
翻译:本文提出DreamLLM这一学习框架,首次实现了具备多模态理解与创作间频繁被忽视的协同增强能力的通用多模态大语言模型(MLLMs)。DreamLLM基于两个基本原则运作:其一,专注于通过在原始多模态空间中直接采样对语言和图像后验进行生成式建模。该方法规避了外部特征提取器(如CLIP)固有的局限性与信息损失,从而获得更全面的多模态理解;其二,DreamLLM促进原始交错文档的生成,同时对文本和图像内容及非结构化布局进行建模。这使得DreamLLM能够有效学习所有条件分布、边缘分布及联合多模态分布。因此,DreamLLM成为首个能够生成自由形式交错内容的多模态大语言模型。全面实验表明,得益于增强的学习协同效应,DreamLLM作为零样本多模态通才展现出卓越性能。