This paper reports on the development of \textbf{a novel style guided diffusion model (SGDiff)} which overcomes certain weaknesses inherent in existing models for image synthesis. The proposed SGDiff combines image modality with a pretrained text-to-image diffusion model to facilitate creative fashion image synthesis. It addresses the limitations of text-to-image diffusion models by incorporating supplementary style guidance, substantially reducing training costs, and overcoming the difficulties of controlling synthesized styles with text-only inputs. This paper also introduces a new dataset -- SG-Fashion, specifically designed for fashion image synthesis applications, offering high-resolution images and an extensive range of garment categories. By means of comprehensive ablation study, we examine the application of classifier-free guidance to a variety of conditions and validate the effectiveness of the proposed model for generating fashion images of the desired categories, product attributes, and styles. The contributions of this paper include a novel classifier-free guidance method for multi-modal feature fusion, a comprehensive dataset for fashion image synthesis application, a thorough investigation on conditioned text-to-image synthesis, and valuable insights for future research in the text-to-image synthesis domain. The code and dataset are available at: \url{https://github.com/taited/SGDiff}.
翻译:本文报告了一种新型风格引导扩散模型(SGDiff)的开发,该模型克服了现有图像合成模型中固有的某些弱点。所提出的SGDiff将图像模态与预训练的文本到图像扩散模型相结合,以促进创意时尚图像合成。它通过融入额外风格引导、显著降低训练成本以及克服仅使用文本输入控制合成风格的困难,解决了文本到图像扩散模型的局限性。本文还引入了一个新数据集——SG-Fashion,该数据集专为时尚图像合成应用设计,提供高分辨率图像和广泛的服装类别。通过全面的消融研究,我们检验了无分类器指导在多种条件下的应用,并验证了所提模型在生成所需类别、产品属性和风格的时尚图像方面的有效性。本文的贡献包括一种用于多模态特征融合的新型无分类器指导方法、一个用于时尚图像合成应用的全面数据集、对条件文本到图像合成的深入探究,以及为文本到图像合成领域的未来研究提供的宝贵见解。代码和数据集可在以下网址获取:\url{https://github.com/taited/SGDiff}。