We present multimodal conditioning modules (MCM) for enabling conditional image synthesis using pretrained diffusion models. Previous multimodal synthesis works rely on training networks from scratch or fine-tuning pretrained networks, both of which are computationally expensive for large, state-of-the-art diffusion models. Our method uses pretrained networks but does not require any updates to the diffusion network's parameters. MCM is a small module trained to modulate the diffusion network's predictions during sampling using 2D modalities (e.g., semantic segmentation maps, sketches) that were unseen during the original training of the diffusion model. We show that MCM enables user control over the spatial layout of the image and leads to increased control over the image generation process. Training MCM is cheap as it does not require gradients from the original diffusion net, consists of only $\sim$1$\%$ of the number of parameters of the base diffusion model, and is trained using only a limited number of training examples. We evaluate our method on unconditional and text-conditional models to demonstrate the improved control over the generated images and their alignment with respect to the conditioning inputs.
翻译:我们提出多模态条件模块(MCM),用于利用预训练扩散模型实现条件图像合成。以往的多模态合成工作依赖于从头训练网络或微调预训练网络,这两种方法对于大规模、最先进的扩散模型而言计算成本高昂。我们的方法使用预训练网络,但无需更新扩散网络的任何参数。MCM是一个小型模块,在采样过程中利用扩散模型原始训练中未见过的二维模态(例如语义分割图、草图),调节扩散网络的预测结果。我们证明,MCM使用户能够控制图像的空间布局,并增强对图像生成过程的控制能力。训练MCM成本低廉,因为它不需要来自原始扩散网络的梯度,参数量仅为基础扩散模型的约1%,且仅需少量训练样本即可完成训练。我们在无条件模型和文本条件模型上评估了该方法,结果表明其对生成图像的控制能力增强,且生成结果与条件输入的匹配度更高。