With the open-sourcing of text-to-image models (T2I) such as stable diffusion (SD) and stable diffusion XL (SD-XL), there is an influx of models fine-tuned in specific domains based on the open-source SD model, such as in anime, character portraits, etc. However, there are few specialized models in certain domains, such as interior design, which is attributed to the complex textual descriptions and detailed visual elements inherent in design, alongside the necessity for adaptable resolution. Therefore, text-to-image models for interior design are required to have outstanding prompt-following capabilities, as well as iterative collaboration with design professionals to achieve the desired outcome. In this paper, we collect and optimize text-image data in the design field and continue training in both English and Chinese on the basis of the open-source CLIP model. We also proposed a fine-tuning strategy with curriculum learning and reinforcement learning from CLIP feedback to enhance the prompt-following capabilities of our approach so as to improve the quality of image generation. The experimental results on the collected dataset demonstrate the effectiveness of the proposed approach, which achieves impressive results and outperforms strong baselines.
翻译:随着文本到图像模型(如Stable Diffusion(SD)和Stable Diffusion XL(SD-XL))的开源,基于开源SD模型在特定领域(如动漫、人物肖像等)微调的模型大量涌现。然而,在室内设计等某些领域,专门化模型却较为缺乏,这归因于设计领域固有的复杂文本描述与细致视觉元素,以及对可适应分辨率的需求。因此,面向室内设计的文本到图像模型需具备卓越的提示遵循能力,并能与设计专业人员迭代协作以达到预期效果。本文收集并优化了设计领域的文本-图像数据,并在开源CLIP模型基础上继续进行中英文训练。我们还提出了结合课程学习与基于CLIP反馈的强化学习的微调策略,以增强方法的提示遵循能力,从而提升图像生成质量。在收集数据集上的实验结果表明,所提方法效果显著,且优于强基线模型。