The recent popularity of text-to-image diffusion models (DM) can largely be attributed to the intuitive interface they provide to users. The intended generation can be expressed in natural language, with the model producing faithful interpretations of text prompts. However, expressing complex or nuanced ideas in text alone can be difficult. To ease image generation, we propose MultiFusion that allows one to express complex and nuanced concepts with arbitrarily interleaved inputs of multiple modalities and languages. MutliFusion leverages pre-trained models and aligns them for integration into a cohesive system, thereby avoiding the need for extensive training from scratch. Our experimental results demonstrate the efficient transfer of capabilities from individual modules to the downstream model. Specifically, the fusion of all independent components allows the image generation module to utilize multilingual, interleaved multimodal inputs despite being trained solely on monomodal data in a single language.
翻译:近期文本到图像扩散模型(DM)的流行在很大程度上归功于其为用户提供的直观交互界面。用户可通过自然语言表达预期生成内容,模型则会生成对文本提示的忠实诠释。然而,仅通过文本表达复杂或细微的概念可能存在困难。为简化图像生成过程,我们提出多融合(MultiFusion)方法,该方法支持用户通过任意交错排列的多模态、多语言输入来表达复杂且细微的概念。多融合利用预训练模型并进行对齐,将其整合为一个协同系统,从而避免大规模从头训练的需求。实验结果表明,各独立模块的能力可高效迁移至下游模型。具体而言,所有独立组件融合后,图像生成模块能够利用多语言、交错排列的多模态输入,尽管其训练数据仅为单一语言的单模态数据。