Recent large-scale generative models learned on big data are capable of synthesizing incredible images yet suffer from limited controllability. This work offers a new generation paradigm that allows flexible control of the output image, such as spatial layout and palette, while maintaining the synthesis quality and model creativity. With compositionality as the core idea, we first decompose an image into representative factors, and then train a diffusion model with all these factors as the conditions to recompose the input. At the inference stage, the rich intermediate representations work as composable elements, leading to a huge design space (i.e., exponentially proportional to the number of decomposed factors) for customizable content creation. It is noteworthy that our approach, which we call Composer, supports various levels of conditions, such as text description as the global information, depth map and sketch as the local guidance, color histogram for low-level details, etc. Besides improving controllability, we confirm that Composer serves as a general framework and facilitates a wide range of classical generative tasks without retraining. Code and models will be made available.
翻译:近期基于大数据学习的生成模型虽能合成令人惊叹的图像,但其可控性仍受局限。本研究提出一种新型生成范式,在保持合成质量与模型创造力的同时,允许灵活控制输出图像的空间布局、调色板等属性。以组合性为核心思想,我们首先将图像分解为若干代表性因子,继而训练扩散模型以所有因子为条件重组输入。在推理阶段,丰富的中间表征可作为可组合元素,为定制化内容创作提供指数级扩展的设计空间(即与分解因子数量成正比)。值得关注的是,本方法命名为Composer,支持多层级条件控制:文本描述作为全局信息,深度图与草图作为局部引导,颜色直方图用于细节调控等。除提升可控性外,我们证实Composer可作为通用框架,无需重新训练即可支持广泛的经典生成任务。代码与模型将开源。