The recent popularity of text-to-image diffusion models (DM) can largely be attributed to the intuitive interface they provide to users. The intended generation can be expressed in natural language, with the model producing faithful interpretations of text prompts. However, expressing complex or nuanced ideas in text alone can be difficult. To ease image generation, we propose MultiFusion that allows one to express complex and nuanced concepts with arbitrarily interleaved inputs of multiple modalities and languages. MutliFusion leverages pre-trained models and aligns them for integration into a cohesive system, thereby avoiding the need for extensive training from scratch. Our experimental results demonstrate the efficient transfer of capabilities from individual modules to the downstream model. Specifically, the fusion of all independent components allows the image generation module to utilize multilingual, interleaved multimodal inputs despite being trained solely on monomodal data in a single language.
翻译:近年来,文本到图像扩散模型(DM)的普及很大程度上归功于其为用户提供的直观交互界面。用户可通过自然语言表达生成意图,模型能忠实诠释文本提示。然而,仅通过文本表达复杂或细微的概念仍存在困难。为简化图像生成流程,我们提出MultiFusion框架,允许用户通过任意交错组合的多模态、多语言输入来表述复杂微妙的概念。该框架利用预训练模型并通过对齐机制将其整合为统一系统,从而避免从零开始进行大规模训练。实验结果表明,该框架能高效地将各独立模块的能力迁移至下游模型。具体而言,尽管图像生成模块仅在单语言单模态数据上训练,但通过融合所有独立组件,该模块仍能处理多语言、交错式多模态输入。