F-Bench: Rethinking Human Preference Evaluation Metrics for Benchmarking Face Generation, Customization, and Restoration

Artificial intelligence generative models exhibit remarkable capabilities in content creation, particularly in face image generation, customization, and restoration. However, current AI-generated faces (AIGFs) often fall short of human preferences due to unique distortions, unrealistic details, and unexpected identity shifts, underscoring the need for a comprehensive quality evaluation framework for AIGFs. To address this need, we introduce FaceQ, a large-scale, comprehensive database of AI-generated Face images with fine-grained Quality annotations reflecting human preferences. The FaceQ database comprises 12,255 images generated by 29 models across three tasks: (1) face generation, (2) face customization, and (3) face restoration. It includes 32,742 mean opinion scores (MOSs) from 180 annotators, assessed across multiple dimensions: quality, authenticity, identity (ID) fidelity, and text-image correspondence. Using the FaceQ database, we establish F-Bench, a benchmark for comparing and evaluating face generation, customization, and restoration models, highlighting strengths and weaknesses across various prompts and evaluation dimensions. Additionally, we assess the performance of existing image quality assessment (IQA), face quality assessment (FQA), AI-generated content image quality assessment (AIGCIQA), and preference evaluation metrics, manifesting that these standard metrics are relatively ineffective in evaluating authenticity, ID fidelity, and text-image correspondence. The FaceQ database will be publicly available upon publication.

翻译：人工智能生成模型在内容创作方面展现出卓越能力，尤其是在人脸图像生成、定制与修复领域。然而，当前AI生成人脸（AIGFs）常因独特的畸变、不真实的细节及意外的身份偏移而难以满足人类偏好，这凸显了对AIGFs进行全面质量评估框架的迫切需求。为应对这一需求，我们提出了FaceQ——一个大规模、全面的AI生成人脸图像数据库，其细粒度质量标注反映了人类偏好。FaceQ数据库包含12,255张图像，由29个模型在三大任务中生成：（1）人脸生成，（2）人脸定制，以及（3）人脸修复。该数据库汇集了来自180位标注者的32,742个平均意见得分（MOSs），评估维度涵盖质量、真实性、身份（ID）保真度及图文一致性。基于FaceQ数据库，我们建立了F-Bench基准，用于比较和评估人脸生成、定制与修复模型，揭示不同提示词及评估维度下各模型的优势与不足。此外，我们评估了现有图像质量评估（IQA）、人脸质量评估（FQA）、AI生成内容图像质量评估（AIGCIQA）及偏好评估指标的性能，结果表明这些标准指标在评估真实性、身份保真度与图文一致性方面相对低效。FaceQ数据库将在论文发表后公开提供。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/