With the open-sourcing of text-to-image models (T2I) such as stable diffusion (SD) and stable diffusion XL (SD-XL), there is an influx of models fine-tuned in specific domains based on the open-source SD model, such as in anime, character portraits, etc. However, there are few specialized models in certain domains, such as interior design, which is attributed to the complex textual descriptions and detailed visual elements inherent in design, alongside the necessity for adaptable resolution. Therefore, text-to-image models for interior design are required to have outstanding prompt-following capabilities, as well as iterative collaboration with design professionals to achieve the desired outcome. In this paper, we collect and optimize text-image data in the design field and continue training in both English and Chinese on the basis of the open-source CLIP model. We also proposed a fine-tuning strategy with curriculum learning and reinforcement learning from CLIP feedback to enhance the prompt-following capabilities of our approach so as to improve the quality of image generation. The experimental results on the collected dataset demonstrate the effectiveness of the proposed approach, which achieves impressive results and outperforms strong baselines.
翻译:随着文本到图像模型(T2I)如稳定扩散(SD)和稳定扩散XL(SD-XL)的开源,基于开源SD模型在特定领域(例如动漫、人物肖像等)微调的模型大量涌现。然而,在某些领域(如室内设计)中,专门模型仍较为稀缺,这归因于设计领域固有的复杂文本描述与精细视觉元素,以及对可变分辨率的需求。因此,面向室内设计的文本到图像模型需具备卓越的提示跟随能力,并能与设计专业人士迭代协作以实现预期效果。本文中,我们收集并优化设计领域的文本-图像数据,在开源CLIP模型基础上针对中英文进行持续训练。同时,我们提出一种融合课程学习与基于CLIP反馈的强化学习的微调策略,以增强方法的提示跟随能力,从而提升图像生成质量。在收集数据集上的实验结果验证了所提方法的有效性,其取得了显著成效并优于强基线模型。