Recent advances in diffusion models have set an impressive milestone in many generation tasks, and trending works such as DALL-E2, Imagen, and Stable Diffusion have attracted great interest. Despite the rapid landscape changes, recent new approaches focus on extensions and performance rather than capacity, thus requiring separate models for separate tasks. In this work, we expand the existing single-flow diffusion pipeline into a multi-task multimodal network, dubbed Versatile Diffusion (VD), that handles multiple flows of text-to-image, image-to-text, and variations in one unified model. The pipeline design of VD instantiates a unified multi-flow diffusion framework, consisting of sharable and swappable layer modules that enable the crossmodal generality beyond images and text. Through extensive experiments, we demonstrate that VD successfully achieves the following: a) VD outperforms the baseline approaches and handles all its base tasks with competitive quality; b) VD enables novel extensions such as disentanglement of style and semantics, dual- and multi-context blending, etc.; c) The success of our multi-flow multimodal framework over images and text may inspire further diffusion-based universal AI research. Our code and models are open-sourced at https://github.com/SHI-Labs/Versatile-Diffusion.
翻译:近期扩散模型的进展已在诸多生成任务中取得显著里程碑,诸如DALL-E2、Imagen和Stable Diffusion等热门工作引发了广泛关注。尽管技术格局快速演变,但最新方法多聚焦于扩展性与性能优化,而非模型容量,导致不同任务仍需独立模型。本研究将现有单流扩散管道扩展为多任务多模态网络,命名为通用扩散(Versatile Diffusion, VD),可在统一模型中处理文本到图像、图像到文本及变体生成的多项流程。VD的管道设计实例化了一个统一的多流扩散框架,包含可共享与可互换的层模块,使其具备超越图像与文本的跨模态通用性。通过广泛实验,我们证明VD成功实现以下目标:a)VD在基础任务上超越基线方法,并以竞争性质量完成所有任务;b)VD支持风格与语义解耦、双/多上下文融合等新型扩展功能;c)本多流多模态框架在图像与文本领域的成功,或可推动基于扩散的通用人工智能研究。相关代码与模型已开源至 https://github.com/SHI-Labs/Versatile-Diffusion。