The ability to collect a large dataset of human preferences from text-to-image users is usually limited to companies, making such datasets inaccessible to the public. To address this issue, we create a web app that enables text-to-image users to generate images and specify their preferences. Using this web app we build Pick-a-Pic, a large, open dataset of text-to-image prompts and real users' preferences over generated images. We leverage this dataset to train a CLIP-based scoring function, PickScore, which exhibits superhuman performance on the task of predicting human preferences. Then, we test PickScore's ability to perform model evaluation and observe that it correlates better with human rankings than other automatic evaluation metrics. Therefore, we recommend using PickScore for evaluating future text-to-image generation models, and using Pick-a-Pic prompts as a more relevant dataset than MS-COCO. Finally, we demonstrate how PickScore can enhance existing text-to-image models via ranking.
翻译:从文本到图像用户中收集大规模人类偏好数据集的能力通常局限于公司,这使得此类数据集难以向公众开放。为解决此问题,我们开发了一个网页应用,使文本到图像用户能够生成图像并指定其偏好。通过该应用,我们构建了Pick-a-Pic——一个包含文本到图像提示词及其用户对生成图像实际偏好的大型开放数据集。我们利用该数据集训练了基于CLIP的评分函数PickScore,该函数在预测人类偏好任务上展现出超越人类的表现。随后,我们验证了PickScore进行模型评估的能力,发现其与人类排名的相关性优于其他自动评估指标。因此,我们建议使用PickScore评估未来的文本到图像生成模型,并采用Pick-a-Pic提示词作为比MS-COCO更具相关性的数据集。最后,我们演示了如何通过排序方法利用PickScore增强现有文本到图像模型。