Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models

Text-to-Image diffusion models have made tremendous progress over the past two years, enabling the generation of highly realistic images based on open-domain text descriptions. However, despite their success, text descriptions often struggle to adequately convey detailed controls, even when composed of long and complex texts. Moreover, recent studies have also shown that these models face challenges in understanding such complex texts and generating the corresponding images. Therefore, there is a growing need to enable more control modes beyond text description. In this paper, we introduce Uni-ControlNet, a novel approach that allows for the simultaneous utilization of different local controls (e.g., edge maps, depth map, segmentation masks) and global controls (e.g., CLIP image embeddings) in a flexible and composable manner within one model. Unlike existing methods, Uni-ControlNet only requires the fine-tuning of two additional adapters upon frozen pre-trained text-to-image diffusion models, eliminating the huge cost of training from scratch. Moreover, thanks to some dedicated adapter designs, Uni-ControlNet only necessitates a constant number (i.e., 2) of adapters, regardless of the number of local or global controls used. This not only reduces the fine-tuning costs and model size, making it more suitable for real-world deployment, but also facilitate composability of different conditions. Through both quantitative and qualitative comparisons, Uni-ControlNet demonstrates its superiority over existing methods in terms of controllability, generation quality and composability. Code is available at \url{https://github.com/ShihaoZhaoZSH/Uni-ControlNet}.

翻译：文本到图像扩散模型在过去两年取得了巨大进展，使得基于开放域文本描述生成高度逼真的图像成为可能。然而，尽管这些模型取得了成功，但文本描述往往难以充分传达精细的控制信息，即便使用了长篇复杂的文本表达。此外，近期研究还表明，这类模型在理解复杂文本及生成对应图像方面仍面临挑战。因此，除文本描述之外，亟需引入更多控制模式。本文提出Uni-ControlNet——一种新颖的方法，能够在单一模型中灵活组合并同时利用不同类型的局部控制（如边缘图、深度图、分割掩码）和全局控制（如CLIP图像嵌入）。与现有方法不同，Uni-ControlNet仅需在冻结的预训练文本到图像扩散模型上微调两个额外适配器，从而避免了从头训练的巨额成本。此外，得益于专门的适配器设计，无论使用多少局部或全局控制条件，Uni-ControlNet仅需固定数量的适配器（即2个）。这不仅降低了微调成本和模型体积，使其更适合实际部署，还增强了不同条件之间的组合能力。通过定量与定性比较，Uni-ControlNet在可控性、生成质量和组合性方面均展现出优于现有方法的性能。代码访问地址：\url{https://github.com/ShihaoZhaoZSH/Uni-ControlNet}。