Text-to-Image (T2I) Diffusion Models have achieved remarkable performance in generating high quality images. However, enabling precise control of continuous attributes, especially multiple attributes simultaneously, in a new domain (e.g., numeric values like eye openness or car width) with text-only guidance remains a significant challenge. To address this, we introduce the Attribute (Att) Adapter, a novel plug-and-play module designed to enable fine-grained, multi-attributes control in pretrained diffusion models. Our approach learns a single control adapter from a set of sample images that can be unpaired and contain multiple visual attributes. The Att-Adapter leverages the decoupled cross attention module to naturally harmonize the multiple domain attributes with text conditioning. We further introduce Conditional Variational Autoencoder (CVAE) to the Att-Adapter to mitigate overfitting, matching the diverse nature of the visual world. Evaluations on two public datasets show that Att-Adapter outperforms all LoRA-based baselines in controlling continuous attributes. Additionally, our method enables a broader control range and also improves disentanglement across multiple attributes, surpassing StyleGAN-based techniques. Notably, Att-Adapter is flexible, requiring no paired synthetic data for training, and is easily scalable to multiple attributes within a single model.
翻译:文本到图像(T2I)扩散模型在生成高质量图像方面已取得显著成就。然而,仅通过文本引导在新领域(例如眼睑开合度或汽车宽度等数值属性)中实现对连续属性,尤其是多个属性同时进行精确控制,仍是一个重大挑战。为此,我们提出属性(Att)适配器,这是一种新颖的即插即用模块,旨在为预训练扩散模型实现细粒度的多属性控制。我们的方法从一组样本图像中学习单一控制适配器,这些样本图像可以是未配对的且包含多个视觉属性。Att-Adapter利用解耦交叉注意力模块,将多个领域属性与文本条件自然协调。我们进一步将条件变分自编码器(CVAE)引入Att-Adapter以缓解过拟合问题,从而匹配视觉世界的多样性本质。在两个公开数据集上的评估表明,Att-Adapter在控制连续属性方面优于所有基于LoRA的基线方法。此外,我们的方法实现了更宽的控制范围,并提升了多属性间的解耦性能,超越了基于StyleGAN的技术。值得注意的是,Att-Adapter具有高度灵活性,训练时无需配对的合成数据,且能轻松扩展至单模型内的多属性控制。