Recent years have witnessed a rapid development of deep generative models for creating synthetic media, such as images and videos. While the practical applications of these models in everyday tasks are enticing, it is crucial to assess the inherent risks regarding their fairness. In this work, we introduce a comprehensive framework for benchmarking the performance and fairness of conditional generative models. We develop a set of metrics$\unicode{x2013}$inspired by their supervised fairness counterparts$\unicode{x2013}$to evaluate the models on their fairness and diversity. Focusing on the specific application of image upsampling, we create a benchmark covering a wide variety of modern upsampling methods. As part of the benchmark, we introduce UnfairFace, a subset of FairFace that replicates the racial distribution of common large-scale face datasets. Our empirical study highlights the importance of using an unbiased training set and reveals variations in how the algorithms respond to dataset imbalances. Alarmingly, we find that none of the considered methods produces statistically fair and diverse results.
翻译:近年来,深度生成模型在合成媒体(如图像和视频)领域取得了快速发展。尽管这些模型在日常任务中的实际应用令人向往,但评估其公平性相关的固有风险至关重要。本文提出了一个综合框架,用于基准测试条件生成模型的性能与公平性。我们受监督学习公平性指标的启发,开发了一套评估模型公平性与多样性的度量标准。聚焦于图像上采样这一具体应用,我们构建了一个涵盖多种现代上采样方法的基准测试。作为该基准的一部分,我们引入了UnfairFace——一个复制常见大规模人脸数据集中种族分布特征的FairFace子集。实证研究结果强调了使用无偏训练集的重要性,并揭示了不同算法对数据集不平衡性响应的差异。令人担忧的是,我们发现所有被评估的方法均未能产生统计上公平且多样化的结果。