We propose PARASOL, a multi-modal synthesis model that enables disentangled, parametric control of the visual style of the image by jointly conditioning synthesis on both content and a fine-grained visual style embedding. We train a latent diffusion model (LDM) using specific losses for each modality and adapt the classifier-free guidance for encouraging disentangled control over independent content and style modalities at inference time. We leverage auxiliary semantic and style-based search to create training triplets for supervision of the LDM, ensuring complementarity of content and style cues. PARASOL shows promise for enabling nuanced control over visual style in diffusion models for image creation and stylization, as well as generative search where text-based search results may be adapted to more closely match user intent by interpolating both content and style descriptors.
翻译:我们提出PARASOL,一种多模态合成模型,通过联合条件化合成过程对内容与细粒度视觉风格嵌入进行调控,实现对图像视觉风格解耦的参数化控制。我们采用针对各模态的特定损失函数训练潜扩散模型,并在推理阶段调整无分类器引导机制,以增强对独立内容与风格模态的解耦控制。为监督潜扩散模型的训练,我们利用辅助语义与风格搜索构建三元组,确保内容与风格线索的互补性。PARASOL在图像生成与风格化过程中展现出对扩散模型视觉风格的精细调控潜力,同时可用于生成式搜索——通过插值内容与风格描述符,使基于文本的搜索结果更贴近用户意图。