Large amounts of tabular data remain underutilized due to privacy, data quality, and data sharing limitations. While training a generative model producing synthetic data resembling the original distribution addresses some of these issues, most applications require additional constraints from the generated data. Existing synthetic data approaches are limited as they typically only handle specific constraints, e.g., differential privacy (DP) or increased fairness, and lack an accessible interface for declaring general specifications. In this work, we introduce ProgSyn, the first programmable synthetic tabular data generation algorithm that allows for comprehensive customization over the generated data. To ensure high data quality while adhering to custom specifications, ProgSyn pre-trains a generative model on the original dataset and fine-tunes it on a differentiable loss automatically derived from the provided specifications. These can be programmatically declared using statistical and logical expressions, supporting a wide range of requirements (e.g., DP or fairness, among others). We conduct an extensive experimental evaluation of ProgSyn on a number of constraints, achieving a new state-of-the-art on some, while remaining general. For instance, at the same fairness level we achieve 2.3% higher downstream accuracy than the state-of-the-art in fair synthetic data generation on the Adult dataset. Overall, ProgSyn provides a versatile and accessible framework for generating constrained synthetic tabular data, allowing for specifications that generalize beyond the capabilities of prior work.
翻译:大量表格数据因隐私、数据质量和共享限制而未被充分利用。虽然训练生成模型产生与原始分布相似的合成数据可部分解决这些问题,但大多数应用需要生成数据满足额外约束。现有合成数据方法受限于仅能处理特定约束(例如差分隐私(DP)或增强公平性),且缺乏声明通用规范的可用接口。本文提出ProgSyn——首个可编程合成表格数据生成算法,支持对生成数据进行全面定制。为确保在满足定制规范的同时保持高数据质量,ProgSyn预先在原始数据集上训练生成模型,并通过从所提供规范自动导出的可微损失对其进行微调。这些规范可使用统计和逻辑表达式以编程方式声明,支持广泛需求(例如DP或公平性等)。我们对ProgSyn在多项约束下进行了广泛实验评估,在保持通用性的同时,部分领域达到了新最优水平。例如,在公平合成数据生成的Adult数据集上,同等公平性水平下,我们的下游准确率比现有最优方法高2.3%。总体而言,ProgSyn为生成受约束的合成表格数据提供了灵活且易用的框架,其规范泛化能力超越了先前工作。