Recent approaches such as ControlNet offer users fine-grained spatial control over text-to-image (T2I) diffusion models. However, auxiliary modules have to be trained for each type of spatial condition, model architecture, and checkpoint, putting them at odds with the diverse intents and preferences a human designer would like to convey to the AI models during the content creation process. In this work, we present FreeControl, a training-free approach for controllable T2I generation that supports multiple conditions, architectures, and checkpoints simultaneously. FreeControl designs structure guidance to facilitate the structure alignment with a guidance image, and appearance guidance to enable the appearance sharing between images generated using the same seed. Extensive qualitative and quantitative experiments demonstrate the superior performance of FreeControl across a variety of pre-trained T2I models. In particular, FreeControl facilitates convenient training-free control over many different architectures and checkpoints, allows the challenging input conditions on which most of the existing training-free methods fail, and achieves competitive synthesis quality with training-based approaches.
翻译:近期诸如ControlNet等方法为用户提供了对文本到图像(T2I)扩散模型的精细空间控制。然而,针对每种空间条件类型、模型架构和检查点,都必须训练辅助模块,这与人机内容创作过程中设计师希望向AI模型传达的多样化意图和偏好相矛盾。本文提出FreeControl,一种无需训练的可控T2I生成方法,可同时支持多种条件、架构和检查点。FreeControl设计了结构引导以促进与引导图像的结构对齐,以及外观引导以实现使用相同种子生成的图像之间的外观共享。大量定性和定量实验表明,FreeControl在各种预训练T2I模型上均具有优越性能。值得注意的是,FreeControl能够便捷地实现对多种不同架构和检查点的无需训练控制,支持现有大多数无需训练方法难以应对的具有挑战性的输入条件,并达到与基于训练方法相竞争的综合质量。