Privacy, data quality, and data sharing concerns pose a key limitation for tabular data applications. While generating synthetic data resembling the original distribution addresses some of these issues, most applications would benefit from additional customization on the generated data. However, existing synthetic data approaches are limited to particular constraints, e.g., differential privacy (DP) or fairness. In this work, we introduce CuTS, the first customizable synthetic tabular data generation framework. Customization in CuTS is achieved via declarative statistical and logical expressions, supporting a wide range of requirements (e.g., DP or fairness, among others). To ensure high synthetic data quality in the presence of custom specifications, CuTS is pre-trained on the original dataset and fine-tuned on a differentiable loss automatically derived from the provided specifications using novel relaxations. We evaluate CuTS over four datasets and on numerous custom specifications, outperforming state-of-the-art specialized approaches on several tasks while being more general. In particular, at the same fairness level, we achieve 2.3% higher downstream accuracy than the state-of-the-art in fair synthetic data generation on the Adult dataset.
翻译:隐私、数据质量和数据共享问题对表格数据应用构成了关键限制。虽然生成与原始分布相似的合成数据能部分解决这些问题,但大多数应用需要进一步定制生成的数据。然而,现有合成数据方法局限于特定约束,例如差分隐私或公平性。本研究提出了CuTS——首个可定制的合成表格数据生成框架。CuTS通过声明式统计和逻辑表达式实现定制化,支持广泛的需求(例如差分隐私或公平性等)。为确保存在定制规范时合成数据的高质量,CuTS在原始数据集上预训练,并通过新颖的松弛技术从用户提供的规范自动导出可微损失函数进行微调。我们在四个数据集和多种定制规范上评估CuTS,在多项任务中优于最先进的专用方法,同时具备更强的通用性。特别是在同等公平性水平下,我们在Adult数据集上的公平合成数据生成任务中,下游任务准确率比现有最优方法高出2.3%。