Generative models can serve as surrogates for some real data sources by creating synthetic training datasets, but in doing so they may transfer biases to downstream tasks. We focus on protecting quality and diversity when generating synthetic training datasets. We propose quality-diversity generative sampling (QDGS), a framework for sampling data uniformly across a user-defined measure space, despite the data coming from a biased generator. QDGS is a model-agnostic framework that uses prompt guidance to optimize a quality objective across measures of diversity for synthetically generated data, without fine-tuning the generative model. Using balanced synthetic datasets generated by QDGS, we first debias classifiers trained on color-biased shape datasets as a proof-of-concept. By applying QDGS to facial data synthesis, we prompt for desired semantic concepts, such as skin tone and age, to create an intersectional dataset with a combined blend of visual features. Leveraging this balanced data for training classifiers improves fairness while maintaining accuracy on facial recognition benchmarks. Code available at: https://github.com/Cylumn/qd-generative-sampling.
翻译:生成模型可通过创建合成训练数据集来作为某些真实数据源的替代,但在此过程中可能将偏差迁移至下游任务。我们聚焦于保护合成训练数据集的质量与多样性。本文提出质量多样性生成采样(QDGS)框架,该框架可在用户定义的度量空间内均匀采样数据,尽管数据源自存在偏差的生成器。QDGS是一种与模型无关的框架,通过提示引导优化合成数据在多样性度量上的质量目标,且无需微调生成模型。利用QDGS生成的平衡合成数据集,我们首先以概念验证方式对基于颜色偏差形状数据集训练的分类器进行去偏。将QDGS应用于面部数据合成时,我们通过提示所需语义概念(如肤色和年龄)创建兼具视觉特征交叉组合的数据集。利用此平衡数据训练分类器,可在保持面部识别基准测试准确性的同时提升公平性。代码地址:https://github.com/Cylumn/qd-generative-sampling。