In the rapidly progressing field of generative models, the development of efficient and high-fidelity text-to-image diffusion systems represents a significant frontier. This study introduces YaART, a novel production-grade text-to-image cascaded diffusion model aligned to human preferences using Reinforcement Learning from Human Feedback (RLHF). During the development of YaART, we especially focus on the choices of the model and training dataset sizes, the aspects that were not systematically investigated for text-to-image cascaded diffusion models before. In particular, we comprehensively analyze how these choices affect both the efficiency of the training process and the quality of the generated images, which are highly important in practice. Furthermore, we demonstrate that models trained on smaller datasets of higher-quality images can successfully compete with those trained on larger datasets, establishing a more efficient scenario of diffusion models training. From the quality perspective, YaART is consistently preferred by users over many existing state-of-the-art models.
翻译:在生成模型快速发展的领域中,开发高效且高保真的文本到图像扩散系统代表着一个重要前沿。本研究介绍了YaART,一种新颖的工业级文本到图像级联扩散模型,该模型通过基于人类反馈的强化学习(RLHF)与人类偏好对齐。在YaART的开发过程中,我们特别关注模型与训练数据集规模的选择,这些方面此前在文本到图像级联扩散模型中尚未得到系统研究。具体而言,我们全面分析了这些选择如何影响训练过程的效率与生成图像的质量,这两者在实际中至关重要。此外,我们证明,基于较小规模的高质量图像数据集训练的模型能够成功与基于更大数据集训练的模型竞争,从而建立了一种更高效的扩散模型训练方案。从质量角度来看,YaART相对于许多现有最先进模型持续获得用户的偏好。