Recent years have witnessed a rapid development of deep generative models for creating synthetic media, such as images and videos. While the practical applications of these models in everyday tasks are enticing, it is crucial to assess the inherent risks regarding their fairness. In this work, we introduce a comprehensive framework for benchmarking the performance and fairness of conditional generative models. We develop a set of metrics$\unicode{x2013}$inspired by their supervised fairness counterparts$\unicode{x2013}$to evaluate the models on their fairness and diversity. Focusing on the specific application of image upsampling, we create a benchmark covering a wide variety of modern upsampling methods. As part of the benchmark, we introduce UnfairFace, a subset of FairFace that replicates the racial distribution of common large-scale face datasets. Our empirical study highlights the importance of using an unbiased training set and reveals variations in how the algorithms respond to dataset imbalances. Alarmingly, we find that none of the considered methods produces statistically fair and diverse results. All experiments can be reproduced using our provided repository.
翻译:近年来,深度生成模型在创建图像、视频等合成媒体方面发展迅速。虽然这些模型在日常任务中的实际应用颇具吸引力,但评估其公平性方面的固有风险至关重要。本文提出了一个用于评估条件生成模型性能与公平性的综合框架。我们借鉴监督学习中公平性评估的对应概念,设计了一套指标来评价模型在公平性和多样性方面的表现。聚焦图像上采样这一具体应用,我们构建了一个涵盖多种现代上采样方法的基准测试。作为基准测试的一部分,我们引入了UnfairFace——一个复制常见大规模人脸数据集种族分布的FairFace子集。我们的实证研究强调了使用无偏训练集的重要性,并揭示了算法对数据集不平衡性的不同响应方式。令人担忧的是,我们发现所有被考虑的方法均未产生统计上公平且多样化的结果。所有实验均可通过我们提供的代码仓库复现。