When a database is protected by Differential Privacy (DP), its usability is limited in scope. In this scenario, generating a synthetic version of the data that mimics the properties of the private data allows users to perform any operation on the synthetic data, while maintaining the privacy of the original data. Therefore, multiple works have been devoted to devising systems for DP synthetic data generation. However, such systems may preserve or even magnify properties of the data that make it unfair, endering the synthetic data unfit for use. In this work, we present PreFair, a system that allows for DP fair synthetic data generation. PreFair extends the state-of-the-art DP data generation mechanisms by incorporating a causal fairness criterion that ensures fair synthetic data. We adapt the notion of justifiable fairness to fit the synthetic data generation scenario. We further study the problem of generating DP fair synthetic data, showing its intractability and designing algorithms that are optimal under certain assumptions. We also provide an extensive experimental evaluation, showing that PreFair generates synthetic data that is significantly fairer than the data generated by leading DP data generation mechanisms, while remaining faithful to the private data.
翻译:当数据库受差分隐私保护时,其可用性受到范围限制。在此场景下,生成模拟私有数据属性的合成数据版本,可使用户在合成数据上执行任意操作,同时维护原始数据的隐私性。因此,已有诸多研究致力于设计差分隐私合成数据生成系统。然而,此类系统可能保留甚至放大导致数据不公的属性,使得合成数据不再适用。本研究提出PreFair系统,可生成差分隐私的公平合成数据。PreFair通过引入因果公平性准则扩展了现有最优差分隐私数据生成机制,确保合成数据的公平性。我们调整了可证明公平性的概念以适应合成数据生成场景,并进一步研究了生成差分隐私公平合成数据的问题,揭示了该问题的难解性,同时设计了在特定假设下最优的算法。通过大量实验评估表明,PreFair生成的合成数据在保持与私有数据一致性的前提下,其公平性显著优于主流差分隐私数据生成机制生成的合成数据。