FBSDiff: Plug-and-Play Frequency Band Substitution of Diffusion Features for Highly Controllable Text-Driven Image Translation

Large-scale text-to-image diffusion models have been a revolutionary milestone in the evolution of generative AI and multimodal technology, allowing extraordinary image generation based on natural-language text prompts. However, the issue of lacking controllability of such models restricts their practical applicability for real-life content creation, for which attention has been focused on leveraging a reference image to control text-to-image synthesis. Due to the close correlation between the reference image and the generated image, this problem can also be regarded as the task of manipulating (or editing) the reference image as per the text, namely text-driven image-to-image translation. This paper contributes a novel, concise, and efficient approach that adapts the pre-trained large-scale text-to-image (T2I) diffusion model to the image-to-image (I2I) paradigm in a plug-and-play manner, realizing high-quality and versatile text-driven I2I translation without any model training, model fine-tuning, or online optimization process. To guide T2I generation with a reference image, we propose to model diverse guiding factors with correspondingly different frequency bands of diffusion features in the DCT spectral space, and accordingly devise a novel frequency band substitution layer that dynamically substitutes a certain DCT frequency band of the diffusion features with the corresponding counterpart of the reference image along the reverse sampling process. We demonstrate that our method flexibly enables highly controllable text-driven I2I translation both in the guiding factor and guiding intensity of the reference image, simply by tuning the type and bandwidth of the substituted frequency band, respectively. Extensive qualitative and quantitative experiments verify the superiority of our approach over related methods in I2I translation visual quality, versatility, and controllability.

翻译：大规模文本到图像扩散模型已成为生成式人工智能和多模态技术发展历程中的革命性里程碑，能够基于自然语言文本提示生成非凡的图像。然而，此类模型缺乏可控性的问题限制了其在现实内容创作中的实际应用，因此研究焦点已转向利用参考图像来控制文本到图像的合成。由于参考图像与生成图像之间存在紧密关联，该问题亦可视为依据文本操控（或编辑）参考图像的任务，即文本驱动的图像到图像翻译。本文提出了一种新颖、简洁且高效的方法，以即插即用的方式将预训练的大规模文本到图像扩散模型适配至图像到图像范式，无需任何模型训练、微调或在线优化过程，即可实现高质量、多功能的文本驱动图像到图像翻译。为利用参考图像引导文本到图像生成，我们提出在DCT谱空间中用扩散特征的不同频带对多样化的引导因素进行建模，并据此设计了一种新颖的频带替换层，该层沿反向采样过程动态地将扩散特征的特定DCT频带替换为参考图像的对应频带。我们证明，仅需分别调整替换频带的类型与带宽，本方法即可灵活实现参考图像在引导因素与引导强度两方面均高度可控的文本驱动图像到图像翻译。大量定性与定量实验验证了本方法在图像到图像翻译的视觉质量、多功能性与可控性方面均优于相关方法。