Controllable text-to-image (T2I) diffusion models generate images conditioned on both text prompts and semantic inputs of other modalities like edge maps. Nevertheless, current controllable T2I methods commonly face challenges related to efficiency and faithfulness, especially when conditioning on multiple inputs from either the same or diverse modalities. In this paper, we propose a novel Flexible and Efficient method, FlexEControl, for controllable T2I generation. At the core of FlexEControl is a unique weight decomposition strategy, which allows for streamlined integration of various input types. This approach not only enhances the faithfulness of the generated image to the control, but also significantly reduces the computational overhead typically associated with multimodal conditioning. Our approach achieves a reduction of 41% in trainable parameters and 30% in memory usage compared with Uni-ControlNet. Moreover, it doubles data efficiency and can flexibly generate images under the guidance of multiple input conditions of various modalities.
翻译:可控文本到图像(T2I)扩散模型能够根据文本提示及其他模态(如边缘图)的语义输入生成图像。然而,当前可控T2I方法普遍面临效率和保真度的挑战,尤其在依赖同模态或跨模态的多个输入条件时。本文提出一种新颖的灵活高效方法——FlexEControl,用于可控T2I生成。其核心在于独特的权重分解策略,可实现对多种输入类型的精简集成,不仅增强了生成图像对控制条件的保真度,还显著降低了多模态条件化绑定的计算开销。与Uni-ControlNet相比,该方法减少了41%的可训练参数和30%的内存占用,同时实现了两倍的数据效率,并能灵活地在多种模态输入条件的引导下生成图像。