Recent controllable generation approaches such as FreeControl and Diffusion Self-guidance bring fine-grained spatial and appearance control to text-to-image (T2I) diffusion models without training auxiliary modules. However, these methods optimize the latent embedding for each type of score function with longer diffusion steps, making the generation process time-consuming and limiting their flexibility and use. This work presents Ctrl-X, a simple framework for T2I diffusion controlling structure and appearance without additional training or guidance. Ctrl-X designs feed-forward structure control to enable the structure alignment with a structure image and semantic-aware appearance transfer to facilitate the appearance transfer from a user-input image. Extensive qualitative and quantitative experiments illustrate the superior performance of Ctrl-X on various condition inputs and model checkpoints. In particular, Ctrl-X supports novel structure and appearance control with arbitrary condition images of any modality, exhibits superior image quality and appearance transfer compared to existing works, and provides instant plug-and-play functionality to any T2I and text-to-video (T2V) diffusion model. See our project page for an overview of the results: https://genforce.github.io/ctrl-x
翻译:近期诸如FreeControl和扩散自引导等可控生成方法,无需训练辅助模块即可为文本到图像扩散模型带来细粒度的空间与外观控制。然而,这些方法需针对每类评分函数优化潜在嵌入并采用更长的扩散步数,导致生成过程耗时且限制了其灵活性与应用。本研究提出Ctrl-X,一个无需额外训练或引导即可控制结构与外观的简易文本到图像扩散框架。Ctrl-X设计了前馈式结构控制以实现与结构图像的对齐,并结合语义感知的外观迁移以促进用户输入图像的外观传递。大量定性与定量实验表明,Ctrl-X在不同条件输入与模型检查点上均展现出优越性能。特别地,Ctrl-X支持任意模态条件图像的新颖结构与外观控制,相比现有工作具有更优的图像质量与外观迁移效果,并能为任意文本到图像及文本到视频扩散模型提供即插即用功能。结果概览请参见项目页面:https://genforce.github.io/ctrl-x