The diffusion-based text-to-image model harbors immense potential in transferring reference style. However, current encoder-based approaches significantly impair the text controllability of text-to-image models while transferring styles. In this paper, we introduce DEADiff to address this issue using the following two strategies: 1) a mechanism to decouple the style and semantics of reference images. The decoupled feature representations are first extracted by Q-Formers which are instructed by different text descriptions. Then they are injected into mutually exclusive subsets of cross-attention layers for better disentanglement. 2) A non-reconstructive learning method. The Q-Formers are trained using paired images rather than the identical target, in which the reference image and the ground-truth image are with the same style or semantics. We show that DEADiff attains the best visual stylization results and optimal balance between the text controllability inherent in the text-to-image model and style similarity to the reference image, as demonstrated both quantitatively and qualitatively. Our project page is https://tianhao-qi.github.io/DEADiff/.
翻译:基于扩散的文生图模型在参考风格迁移方面蕴含着巨大潜力。然而,当前基于编码器的方法在迁移风格时会显著削弱文生图模型的文本可控性。本文引入DEADiff模型,通过以下两种策略解决该问题:1)一种解耦参考图像风格与语义的机制。首先,由不同文本描述指导的Q-Former提取解耦后的特征表示,随后将特征注入交叉注意力层互斥子集以实现更优解耦。2)一种非重构学习方法。Q-Former采用配对图像(而非同一目标图像)进行训练,其中参考图像与真实图像具有相同风格或语义。我们通过定量与定性实验证明,DEADiff在视觉风格化效果最优的同时,实现了文生图模型固有文本可控性与参考图像风格相似度的最佳平衡。项目页面:https://tianhao-qi.github.io/DEADiff/。