We present a comprehensive solution to learn and improve text-to-image models from human preference feedback. To begin with, we build ImageReward -- the first general-purpose text-to-image human preference reward model -- to effectively encode human preferences. Its training is based on our systematic annotation pipeline including rating and ranking, which collects 137k expert comparisons to date. In human evaluation, ImageReward outperforms existing scoring models and metrics, making it a promising automatic metric for evaluating text-to-image synthesis. On top of it, we propose Reward Feedback Learning (ReFL), a direct tuning algorithm to optimize diffusion models against a scorer. Both automatic and human evaluation support ReFL's advantages over compared methods. All code and datasets are provided at \url{https://github.com/THUDM/ImageReward}.
翻译:我们提出了一套从人类偏好反馈中学习和改进文本到图像模型的综合性解决方案。首先,我们构建了ImageReward——首个通用文本到图像人类偏好奖励模型——以有效编码人类偏好。其训练基于包含评分与排序的系统化标注流程,迄今已收集了13.7万次专家对比数据。在人工评估中,ImageReward优于现有评分模型和指标,使其成为评估文本到图像合成的有前景的自动评估指标。在此基础上,我们提出奖励反馈学习(ReFL),一种直接调优算法,用于优化扩散模型以契合评分器。自动与人工评估均支持ReFL相较于对比方法的优势。所有代码和数据集已发布于\url{https://github.com/THUDM/ImageReward}。