We present CoDi-2, a versatile and interactive Multimodal Large Language Model (MLLM) that can follow complex multimodal interleaved instructions, conduct in-context learning (ICL), reason, chat, edit, etc., in an any-to-any input-output modality paradigm. By aligning modalities with language for both encoding and generation, CoDi-2 empowers Large Language Models (LLMs) to not only understand complex modality-interleaved instructions and in-context examples, but also autoregressively generate grounded and coherent multimodal outputs in the continuous feature space. To train CoDi-2, we build a large-scale generation dataset encompassing in-context multimodal instructions across text, vision, and audio. CoDi-2 demonstrates a wide range of zero-shot capabilities for multimodal generation, such as in-context learning, reasoning, and compositionality of any-to-any modality generation through multi-round interactive conversation. CoDi-2 surpasses previous domain-specific models on tasks such as subject-driven image generation, vision transformation, and audio editing. CoDi-2 signifies a substantial breakthrough in developing a comprehensive multimodal foundation model adept at interpreting in-context language-vision-audio interleaved instructions and producing multimodal outputs.
翻译:我们提出CoDi-2,一种多功能且交互式的多模态大语言模型(MLLM),能够遵循复杂的多模态交错指令,执行上下文学习(ICL)、推理、对话、编辑等任务,采用任意到任意输入输出模态范式。通过将模态与语言对齐以进行编码与生成,CoDi-2使大语言模型(LLM)不仅能理解复杂的模态交错指令和上下文示例,还能在连续特征空间中自回归生成基于现实且连贯的多模态输出。为训练CoDi-2,我们构建了一个大规模生成数据集,涵盖文本、视觉和音频模态中的上下文多模态指令。CoDi-2在多模态生成方面展示了广泛的零样本能力,例如通过多轮交互对话实现上下文学习、推理以及任意到任意模态生成的组合性。CoDi-2在主题驱动图像生成、视觉变换和音频编辑等任务上超越了以往的领域专用模型。CoDi-2标志着在开发能够解释上下文语言-视觉-音频交错指令并产生多模态输出的全面多模态基础模型方面取得了重大突破。