MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal Image Generation

Marco Bellagente,Manuel Brack,Hannah Teufel,Felix Friedrich,Björn Deiseroth,Constantin Eichenberg,Andrew Dai,Robert Baldock,Souradeep Nanda,Koen Oostermeijer,Andres Felipe Cruz-Salinas,Patrick Schramowski,Kristian Kersting,Samuel Weinbach

from arxiv, Proceedings of Advances in Neural Information Processing Systems: Annual Conference on Neural Information Processing Systems (NeurIPS)

The recent popularity of text-to-image diffusion models (DM) can largely be attributed to the intuitive interface they provide to users. The intended generation can be expressed in natural language, with the model producing faithful interpretations of text prompts. However, expressing complex or nuanced ideas in text alone can be difficult. To ease image generation, we propose MultiFusion that allows one to express complex and nuanced concepts with arbitrarily interleaved inputs of multiple modalities and languages. MutliFusion leverages pre-trained models and aligns them for integration into a cohesive system, thereby avoiding the need for extensive training from scratch. Our experimental results demonstrate the efficient transfer of capabilities from individual modules to the downstream model. Specifically, the fusion of all independent components allows the image generation module to utilize multilingual, interleaved multimodal inputs despite being trained solely on monomodal data in a single language.

翻译：近期文本到图像扩散模型（DM）的流行主要归因于其为用户提供的直观交互界面。用户可通过自然语言表达生成意图，模型则能够忠实解读文本提示进行图像生成。然而，仅通过文本表达复杂或细微的概念存在困难。为简化图像生成过程，我们提出多融合（MultiFusion）方法，允许用户通过任意交错输入的多模态、多语言数据来表达复杂且细微的概念。该方法利用预训练模型并对其进行对齐整合，形成统一的系统，从而避免从零开始大规模训练。实验结果表明，各独立模块的能力可高效迁移至下游模型。具体而言，尽管图像生成模块仅基于单语言单模态数据训练，但通过融合所有独立组件，该模块能够有效利用多语言、交错多模态输入进行图像生成。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日