Deep generative models have shown impressive results in text-to-image synthesis. However, current text-to-image models often generate images that are inadequately aligned with text prompts. We propose a fine-tuning method for aligning such models using human feedback, comprising three stages. First, we collect human feedback assessing model output alignment from a set of diverse text prompts. We then use the human-labeled image-text dataset to train a reward function that predicts human feedback. Lastly, the text-to-image model is fine-tuned by maximizing reward-weighted likelihood to improve image-text alignment. Our method generates objects with specified colors, counts and backgrounds more accurately than the pre-trained model. We also analyze several design choices and find that careful investigations on such design choices are important in balancing the alignment-fidelity tradeoffs. Our results demonstrate the potential for learning from human feedback to significantly improve text-to-image models.
翻译:深度生成模型在文本到图像合成中已展现出令人瞩目的成果。然而,当前的文本到图像模型生成的图像往往与文本提示的对齐不够充分。我们提出了一种利用人类反馈对这些模型进行微调的方法,该方法包含三个阶段。首先,我们收集人类对一组多样化文本提示下模型输出对齐程度的评估反馈。接着,我们利用经过人工标注的图像-文本数据集来训练一个预测人类反馈的奖励函数。最后,通过最大化基于奖励加权的似然来微调文本到图像模型,以改善图像与文本的对齐效果。我们的方法在生成指定颜色、数量和背景的物体时,比预训练模型更准确。我们还分析了若干设计选择,并发现对这些设计选择进行细致研究对于平衡对齐-保真度权衡至关重要。我们的结果证明了从人类反馈中学习以显著改进文本到图像模型的潜力。