Diffusion models (DMs) synthesize high-quality images in various domains. However, controlling their generative process is still hazy because the intermediate variables in the process are not rigorously studied. Recently, StyleCLIP-like editing of DMs is found in the bottleneck of the U-Net, named $h$-space. In this paper, we discover that DMs inherently have disentangled representations for content and style of the resulting images: $h$-space contains the content and the skip connections convey the style. Furthermore, we introduce a principled way to inject content of one image to another considering progressive nature of the generative process. Briefly, given the original generative process, 1) the feature of the source content should be gradually blended, 2) the blended feature should be normalized to preserve the distribution, 3) the change of skip connections due to content injection should be calibrated. Then, the resulting image has the source content with the style of the original image just like image-to-image translation. Interestingly, injecting contents to styles of unseen domains produces harmonization-like style transfer. To the best of our knowledge, our method introduces the first training-free feed-forward style transfer only with an unconditional pretrained frozen generative network. The code is available at https://curryjung.github.io/DiffStyle/.
翻译:扩散模型(DMs)能在不同领域生成高质量图像。然而,由于生成过程中的中间变量尚未被严格研究,控制其生成过程仍具模糊性。近期,在U-Net的瓶颈层(即$h$空间)中发现了类似StyleCLIP的扩散模型编辑能力。本文发现扩散模型天然具备对生成图像内容与风格的解耦表征:$h$空间存储内容信息,跳跃连接传递风格特征。此外,我们考虑生成过程的渐进特性,提出将一张图像的内容注入另一张图像的原则性方法。简言之,在原始生成过程中需满足三个条件:1)源内容特征应逐步混合;2)混合特征需归一化以保持分布不变;3)因内容注入导致的跳跃连接变化需进行校准。由此生成的图像将保留源内容与原有图像风格,实现类似图像到图像翻译的效果。有趣的是,将内容注入未见领域风格时会产生类似风格迁移的协调效果。据我们所知,本方法首次仅利用无条件预训练冻结生成网络实现无需训练的前馈式风格迁移。代码已开源至https://curryjung.github.io/DiffStyle/。