DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations

The diffusion-based text-to-image model harbors immense potential in transferring reference style. However, current encoder-based approaches significantly impair the text controllability of text-to-image models while transferring styles. In this paper, we introduce \textit{DEADiff} to address this issue using the following two strategies: 1) a mechanism to decouple the style and semantics of reference images. The decoupled feature representations are first extracted by Q-Formers which are instructed by different text descriptions. Then they are injected into mutually exclusive subsets of cross-attention layers for better disentanglement. 2) A non-reconstructive learning method. The Q-Formers are trained using paired images rather than the identical target, in which the reference image and the ground-truth image are with the same style or semantics. We show that DEADiff attains the best visual stylization results and optimal balance between the text controllability inherent in the text-to-image model and style similarity to the reference image, as demonstrated both quantitatively and qualitatively. Our project page is~\href{https://tianhao-qi.github.io/DEADiff/}{https://tianhao-qi.github.io/DEADiff/}.

翻译：基于扩散的文本到图像模型在迁移参考风格方面具有巨大潜力。然而，当前基于编码器的方法在迁移风格的同时显著削弱了文本到图像模型的文本可控性。本文提出\textit{DEADiff}，通过以下两种策略解决该问题：1）一种解耦参考图像风格与语义的机制。首先由不同文本描述引导的Q-Former提取解耦的特征表示，随后将这些表示注入交叉注意力层互斥的子集以实现更优的解耦。2）一种非重构学习方法。Q-Former通过成对图像（而非同一目标）进行训练，其中参考图像与真实图像具有相同风格或语义。我们通过定性和定量实验证明，DEADiff在保持文本到图像模型固有文本可控性与参考图像风格相似性之间达到了最佳平衡，并取得了最优的视觉风格化结果。项目页面为~\href{https://tianhao-qi.github.io/DEADiff/}{https://tianhao-qi.github.io/DEADiff/}。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日