Text-to-image diffusion models have achieved widespread popularity due to their unprecedented image generation capability. In particular, their ability to synthesize and modify human faces has spurred research into using generated face images in both training data augmentation and model performance assessments. In this paper, we study the efficacy and shortcomings of generative models in the context of face generation. Utilizing a combination of qualitative and quantitative measures, including embedding-based metrics and user studies, we present a framework to audit the characteristics of generated faces conditioned on a set of social attributes. We applied our framework on faces generated through state-of-the-art text-to-image diffusion models. We identify several limitations of face image generation that include faithfulness to the text prompt, demographic disparities, and distributional shifts. Furthermore, we present an analytical model that provides insights into how training data selection contributes to the performance of generative models.
翻译:文本到图像扩散模型因其前所未有的图像生成能力而广受欢迎。特别是,它们合成和修改人脸的能力促使人们在使用生成的人脸图像进行训练数据增强和模型性能评估方面展开研究。本文研究了生成模型在人脸生成背景下的有效性和局限性。通过结合定性和定量指标(包括基于嵌入的度量和用户研究),我们提出了一个框架,用于审核在一组社会属性条件下生成的人脸特征。我们将该框架应用于通过最先进的文本到图像扩散模型生成的人脸。我们发现了人脸图像生成的若干局限性,包括对文本提示的忠实度、人口统计差异以及分布偏移。此外,我们提出了一个分析模型,该模型揭示了训练数据选择如何影响生成模型的性能。