Large amounts of tabular data remain underutilized due to privacy, data quality, and data sharing limitations. While training a generative model producing synthetic data resembling the original distribution addresses some of these issues, most applications require additional constraints from the generated data. Existing synthetic data approaches are limited as they typically only handle specific constraints, e.g., differential privacy (DP) or increased fairness, and lack an accessible interface for declaring general specifications. In this work, we introduce ProgSyn, the first programmable synthetic tabular data generation algorithm that allows for comprehensive customization over the generated data. To ensure high data quality while adhering to custom specifications, ProgSyn pre-trains a generative model on the original dataset and fine-tunes it on a differentiable loss automatically derived from the provided specifications. These can be programmatically declared using statistical and logical expressions, supporting a wide range of requirements (e.g., DP or fairness, among others). We conduct an extensive experimental evaluation of ProgSyn on a number of constraints, achieving a new state-of-the-art on some, while remaining general. For instance, at the same fairness level we achieve 2.3% higher downstream accuracy than the state-of-the-art in fair synthetic data generation on the Adult dataset. Overall, ProgSyn provides a versatile and accessible framework for generating constrained synthetic tabular data, allowing for specifications that generalize beyond the capabilities of prior work.
翻译:大量表格数据因隐私、数据质量和数据共享限制而未被充分利用。尽管训练生成模型以生成近似原始分布的合成数据可部分解决这些问题,但大多数应用对生成的数据有额外约束要求。现有合成数据方法通常仅能处理特定约束(例如差分隐私或公平性增强),且缺乏声明通用规范的易用接口,存在局限性。本研究提出ProgSyn——首个可编程合成表格数据生成算法,能够对生成数据进行全面定制。为确保高质量数据同时满足定制规范,ProgSyn在原始数据集上预训练生成模型,并通过从所提供规范中自动导出的可微分损失对其进行微调。这些规范可通过统计和逻辑表达式以编程方式声明,支持广泛需求(如差分隐私或公平性等)。我们在多种约束条件下对ProgSyn进行了广泛实验评估,在保持通用性的同时,在某些约束上达到了新最先进水平。例如,在Adult数据集上实现同等公平水平时,我们的下游准确率比公平合成数据生成领域最先进方法高出2.3%。总体而言,ProgSyn为生成受约束的合成表格数据提供了多功能且易用的框架,其可支持的规范范围超越了先前方法的能力。