The recent popularity of text-to-image diffusion models (DM) can largely be attributed to the intuitive interface they provide to users. The intended generation can be expressed in natural language, with the model producing faithful interpretations of text prompts. However, expressing complex or nuanced ideas in text alone can be difficult. To ease image generation, we propose MultiFusion that allows one to express complex and nuanced concepts with arbitrarily interleaved inputs of multiple modalities and languages. MutliFusion leverages pre-trained models and aligns them for integration into a cohesive system, thereby avoiding the need for extensive training from scratch. Our experimental results demonstrate the efficient transfer of capabilities from individual modules to the downstream model. Specifically, the fusion of all independent components allows the image generation module to utilize multilingual, interleaved multimodal inputs despite being trained solely on monomodal data in a single language.
翻译:近期文本到图像扩散模型(DM)的流行主要归因于其为用户提供的直观交互界面。用户可通过自然语言表达生成意图,模型则能够忠实解读文本提示进行图像生成。然而,仅通过文本表达复杂或细微的概念存在困难。为简化图像生成过程,我们提出多融合(MultiFusion)方法,允许用户通过任意交错输入的多模态、多语言数据来表达复杂且细微的概念。该方法利用预训练模型并对其进行对齐整合,形成统一的系统,从而避免从零开始大规模训练。实验结果表明,各独立模块的能力可高效迁移至下游模型。具体而言,尽管图像生成模块仅基于单语言单模态数据训练,但通过融合所有独立组件,该模块能够有效利用多语言、交错多模态输入进行图像生成。