We propose Reference-Based Modulation (RB-Modulation), a new plug-and-play solution for training-free personalization of diffusion models. Existing training-free approaches exhibit difficulties in (a) style extraction from reference images in the absence of additional style or content text descriptions, (b) unwanted content leakage from reference style images, and (c) effective composition of style and content. RB-Modulation is built on a novel stochastic optimal controller where a style descriptor encodes the desired attributes through a terminal cost. The resulting drift not only overcomes the difficulties above, but also ensures high fidelity to the reference style and adheres to the given text prompt. We also introduce a cross-attention-based feature aggregation scheme that allows RB-Modulation to decouple content and style from the reference image. With theoretical justification and empirical evidence, our framework demonstrates precise extraction and control of content and style in a training-free manner. Further, our method allows a seamless composition of content and style, which marks a departure from the dependency on external adapters or ControlNets.
翻译:我们提出了一种基于参考的调制方法,这是一种用于扩散模型无训练个性化的新型即插即用解决方案。现有无训练方法存在以下困难:(a) 在缺乏额外风格或内容文本描述时从参考图像中提取风格,(b) 参考风格图像中产生不必要的内容泄露,以及(c) 风格与内容的有效组合。RB-Modulation建立在一种新颖的随机最优控制器之上,其中风格描述符通过终端代价函数编码所需属性。由此产生的漂移过程不仅克服了上述困难,还能确保对参考风格的高保真度并遵循给定的文本提示。我们还引入了一种基于交叉注意力的特征聚合方案,使RB-Modulation能够从参考图像中解耦内容与风格。通过理论论证和实证证据,我们的框架展示了以无训练方式对内容和风格进行精确提取与控制的能力。此外,我们的方法允许内容与风格的无缝组合,这标志着对外部适配器或ControlNets依赖性的突破。