Current metrics for text-to-image models typically rely on statistical metrics which inadequately represent the real preference of humans. Although recent work attempts to learn these preferences via human annotated images, they reduce the rich tapestry of human preference to a single overall score. However, the preference results vary when humans evaluate images with different aspects. Therefore, to learn the multi-dimensional human preferences, we propose the Multi-dimensional Preference Score (MPS), the first multi-dimensional preference scoring model for the evaluation of text-to-image models. The MPS introduces the preference condition module upon CLIP model to learn these diverse preferences. It is trained based on our Multi-dimensional Human Preference (MHP) Dataset, which comprises 918,315 human preference choices across four dimensions (i.e., aesthetics, semantic alignment, detail quality and overall assessment) on 607,541 images. The images are generated by a wide range of latest text-to-image models. The MPS outperforms existing scoring methods across 3 datasets in 4 dimensions, enabling it a promising metric for evaluating and improving text-to-image generation.
翻译:当前文本到图像模型的评估指标通常依赖于统计度量,这些度量不足以代表人类的真实偏好。尽管近期研究尝试通过人工标注图像来学习这些偏好,但它们将人类偏好的丰富内涵简化为单一的整体分数。然而,当人类从不同维度评估图像时,偏好结果存在差异。因此,为学习多维度的人类偏好,我们提出了多维度偏好评分模型,这是首个用于评估文本到图像模型的多维度偏好评分模型。MPS在CLIP模型基础上引入了偏好条件模块,以学习这些多样化的偏好。该模型基于我们构建的多维度人类偏好数据集进行训练,该数据集包含607,541张图像上的918,315个人类偏好选择,涵盖四个维度(即美学、语义对齐、细节质量和整体评估)。这些图像由多种最新的文本到图像模型生成。MPS在3个数据集的4个维度上均优于现有评分方法,使其成为评估和改进文本到图像生成的有前景的度量标准。