Human evaluation is critical for validating the performance of text-to-image generative models, as this highly cognitive process requires deep comprehension of text and images. However, our survey of 37 recent papers reveals that many works rely solely on automatic measures (e.g., FID) or perform poorly described human evaluations that are not reliable or repeatable. This paper proposes a standardized and well-defined human evaluation protocol to facilitate verifiable and reproducible human evaluation in future works. In our pilot data collection, we experimentally show that the current automatic measures are incompatible with human perception in evaluating the performance of the text-to-image generation results. Furthermore, we provide insights for designing human evaluation experiments reliably and conclusively. Finally, we make several resources publicly available to the community to facilitate easy and fast implementations.
翻译:人工评估对于验证文生图生成模型的性能至关重要,因为这一高度认知过程需要深入理解文本与图像。然而,我们对37篇近期论文的调研显示,许多研究仅依赖自动评估指标(如FID)或采用描述不充分、不可靠且无法重复的人工评估方法。本文提出一种标准化且明确界定的人工评估方案,以促进未来研究中可验证与可复现的人工评估。在初步数据收集中,我们通过实验证明当前自动评估指标在评价文生图生成结果性能时与人类感知存在不一致。此外,我们为可靠且结论明确的人工评估实验设计提供了见解。最后,我们向学界公开了多项资源,以支持便捷高效的实践应用。