We revisit the fairness notion of disparate impact for synthetic data generation (SDG), that assesses whether the utility of generated records is the same across sensitive groups. Our approach departs from existing work on fair SDG, that address the problem of correcting for undue biases in the observed distribution, hence redefining SDG as learning a distribution that is not that of the real data. By contrast, non-disparate impact is notably achieved when the synthetic and real distributions are the same. We expose reasons why SDG may fail to reach that solution and discuss why approximation and estimation errors occur and can be disparate across groups. We notably look into the expressive power of SDG methods relative to distribution complexity, sampling errors due to group proportions, and estimation errors induced by differential privacy mechanisms. We illustrate cases of disparate impact on both artificial and real-world data, focusing on SDG methods that rely on probabilistic graphical models. We also introduce a strategy of learning group-wise SDG models and illustrate how it can improve both the overall utility and its parity in many settings.
翻译:我们重新审视了合成数据生成中关于差异化影响的公平性概念,该概念评估生成记录在不同敏感群体间是否具有相同的效用。我们的方法与现有关于公平合成数据生成的研究不同,后者旨在修正观测分布中的不当偏差,从而将合成数据生成重新定义为学习一个非真实数据分布的分布。相比之下,当合成分布与真实分布相同时,则显著实现了无差异化影响。我们揭示了合成数据生成可能无法达到这一解决方案的原因,并讨论了近似误差和估计误差为何会出现以及如何在群体间产生差异。我们特别关注了合成数据生成方法的表达能力相对于分布复杂性、群体比例导致的抽样误差以及差分隐私机制引发的估计误差。我们以依赖概率图模型的合成数据生成方法为例,在人工和真实数据上展示了差异化影响的实例。我们还引入了一种按群体学习合成数据生成模型的策略,并说明它如何在多种场景下提升整体效用及其均等性。