Recent text-to-image generative models can generate high-fidelity images from text inputs, but the quality of these generated images cannot be accurately evaluated by existing evaluation metrics. To address this issue, we introduce Human Preference Dataset v2 (HPD v2), a large-scale dataset that captures human preferences on images from a wide range of sources. HPD v2 comprises 798,090 human preference choices on 430,060 pairs of images, making it the largest dataset of its kind. The text prompts and images are deliberately collected to eliminate potential bias, which is a common issue in previous datasets. By fine-tuning CLIP on HPD v2, we obtain Human Preference Score v2 (HPS v2), a scoring model that can more accurately predict text-generated images' human preferences. Our experiments demonstrate that HPS v2 generalizes better than previous metrics across various image distributions and is responsive to algorithmic improvements of text-to-image generative models, making it a preferable evaluation metric for these models. We also investigate the design of the evaluation prompts for text-to-image generative models, to make the evaluation stable, fair and easy-to-use. Finally, we establish a benchmark for text-to-image generative models using HPS v2, which includes a set of recent text-to-image models from the academia, community and industry. The code and dataset is / will be available at https://github.com/tgxs002/HPSv2.
翻译:近期文本到图像生成模型能够从文本输入生成高保真图像,但现有评估指标无法准确衡量这些生成图像的质量。为解决此问题,我们引入了人类偏好数据集v2(HPD v2),这是一个大规模数据集,捕捉了人类对来自广泛来源图像的偏好。HPD v2包含798,090个人类偏好选择,涵盖430,060对图像,是同类数据集中规模最大的。文本提示和图像经过精心收集以消除潜在偏差,这是以往数据集中的常见问题。通过在HPD v2上微调CLIP,我们获得了人类偏好评分v2(HPS v2),这是一种能够更准确预测文本生成图像人类偏好的评分模型。实验表明,HPS v2在多种图像分布上的泛化能力优于以往指标,并能响应文本到图像生成模型的算法改进,使其成为这些模型更优的评估指标。我们还研究了文本到图像生成模型评估提示的设计,以确保评估的稳定性、公平性和易用性。最后,我们使用HPS v2建立了文本到图像生成模型的基准,涵盖来自学术界、社区和工业界的最新文本到图像模型。代码和数据集现已/将发布于https://github.com/tgxs002/HPSv2。