To enhance the controllability of text-to-image diffusion models, current ControlNet-like models have explored various control signals to dictate image attributes. However, existing methods either handle conditions inefficiently or use a fixed number of conditions, which does not fully address the complexity of multiple conditions and their potential conflicts. This underscores the need for innovative approaches to manage multiple conditions effectively for more reliable and detailed image synthesis. To address this issue, we propose a novel framework, DynamicControl, which supports dynamic combinations of diverse control signals, allowing adaptive selection of different numbers and types of conditions. Our approach begins with a double-cycle controller that generates an initial real score sorting for all input conditions by leveraging pre-trained conditional generation models and discriminative models. This controller evaluates the similarity between extracted conditions and input conditions, as well as the pixel-level similarity with the source image. Then, we integrate a Multimodal Large Language Model (MLLM) to build an efficient condition evaluator. This evaluator optimizes the ordering of conditions based on the double-cycle controller's score ranking. Our method jointly optimizes MLLMs and diffusion models, utilizing MLLMs' reasoning capabilities to facilitate multi-condition text-to-image (T2I) tasks. The final sorted conditions are fed into a parallel multi-control adapter, which learns feature maps from dynamic visual conditions and integrates them to modulate ControlNet, thereby enhancing control over generated images. Through both quantitative and qualitative comparisons, DynamicControl demonstrates its superiority over existing methods in terms of controllability, generation quality and composability under various conditional controls.
翻译:为增强文本到图像扩散模型的可控性,当前类ControlNet模型已探索多种控制信号以约束图像属性。然而,现有方法要么低效处理条件,要么使用固定数量的条件,未能充分应对多条件复杂性及其潜在冲突。这凸显了需要创新方法来有效管理多条件以实现更可靠且精细的图像合成。针对该问题,我们提出新型框架DynamicControl,支持多种控制信号的动态组合,允许自适应选择不同数量与类型的条件。我们的方法首先采用双循环控制器,通过利用预训练条件生成模型与判别模型,为所有输入条件生成初始真实分数排序。该控制器评估提取条件与输入条件之间的相似性,以及与源图像的像素级相似性。随后,我们集成多模态大语言模型构建高效条件评估器,该评估器基于双循环控制器的分数排序优化条件顺序。我们的方法联合优化MLLM与扩散模型,利用MLLM的推理能力促进多条件文本到图像任务。最终排序后的条件输入并行多控制适配器,该适配器从动态视觉条件中学习特征图并将其集成以调制ControlNet,从而增强对生成图像的控制。通过定量与定性对比,DynamicControl在各种条件控制下,于可控性、生成质量与组合能力方面均展现出优于现有方法的性能。