Generative models can serve as surrogates for some real data sources by creating synthetic training datasets, but in doing so they may transfer biases to downstream tasks. We focus on protecting quality and diversity when generating synthetic training datasets. We propose quality-diversity generative sampling (QDGS), a framework for sampling data uniformly across a user-defined measure space, despite the data coming from a biased generator. QDGS is a model-agnostic framework that uses prompt guidance to optimize a quality objective across measures of diversity for synthetically generated data, without fine-tuning the generative model. Using balanced synthetic datasets generated by QDGS, we first debias classifiers trained on color-biased shape datasets as a proof-of-concept. By applying QDGS to facial data synthesis, we prompt for desired semantic concepts, such as skin tone and age, to create an intersectional dataset with a combined blend of visual features. Leveraging this balanced data for training classifiers improves fairness while maintaining accuracy on facial recognition benchmarks. Code available at: https://github.com/Cylumn/qd-generative-sampling
翻译:生成模型可以通过创建合成训练数据集来充当某些真实数据源的替代品,但在此过程中可能会将偏差传递至下游任务。我们聚焦于在生成合成训练数据集时保护质量与多样性。提出了质量-多样性生成采样(QDGS),这是一个能够在用户定义的度量空间内均匀采样数据的框架,即使数据来自有偏的生成器。QDGS是一种与模型无关的框架,通过提示引导优化合成生成数据在多样性度量上的质量目标,而无需微调生成模型。利用QDGS生成的平衡合成数据集,我们首先以概念验证方式去偏训练于颜色有偏形状数据集的分类器。通过将QDGS应用于人脸数据合成,我们提示所需的语义概念(如肤色和年龄),以创建具有视觉特征混合组合的交叉数据集。利用这些平衡数据训练分类器,能在保持人脸识别基准准确率的同时提升公平性。代码详见:https://github.com/Cylumn/qd-generative-sampling