Content and style (C-S) disentanglement is a fundamental problem and critical challenge of style transfer. Existing approaches based on explicit definitions (e.g., Gram matrix) or implicit learning (e.g., GANs) are neither interpretable nor easy to control, resulting in entangled representations and less satisfying results. In this paper, we propose a new C-S disentangled framework for style transfer without using previous assumptions. The key insight is to explicitly extract the content information and implicitly learn the complementary style information, yielding interpretable and controllable C-S disentanglement and style transfer. A simple yet effective CLIP-based style disentanglement loss coordinated with a style reconstruction prior is introduced to disentangle C-S in the CLIP image space. By further leveraging the powerful style removal and generative ability of diffusion models, our framework achieves superior results than state of the art and flexible C-S disentanglement and trade-off control. Our work provides new insights into the C-S disentanglement in style transfer and demonstrates the potential of diffusion models for learning well-disentangled C-S characteristics.
翻译:内容与风格(C-S)解耦是风格迁移领域的基础问题和关键挑战。现有的基于显式定义(如Gram矩阵)或隐式学习(如生成对抗网络)的方法既不具可解释性也难以控制,导致表征纠缠且结果不尽人意。本文提出一种无需借助先前假设的全新C-S解耦框架用于风格迁移。其核心思想在于显式提取内容信息并隐式学习互补的风格信息,从而获得可解释且可控的C-S解耦与风格迁移。我们引入一种简单而有效的基于CLIP的风格解耦损失函数,配合风格重建先验,在CLIP图像空间中实现C-S解耦。通过进一步利用扩散模型强大的风格移除能力与生成能力,本框架取得了超越现有最优方法的卓越成果,并实现了灵活的C-S解耦与权衡控制。本研究为风格迁移中的C-S解耦提供了新视角,展示了扩散模型在学习高度解耦的C-S特征方面的巨大潜力。