Privacy, data quality, and data sharing concerns pose a key limitation for tabular data applications. While generating synthetic data resembling the original distribution addresses some of these issues, most applications would benefit from additional customization on the generated data. However, existing synthetic data approaches are limited to particular constraints, e.g., differential privacy (DP) or fairness. In this work, we introduce CuTS, the first customizable synthetic tabular data generation framework. Customization in CuTS is achieved via declarative statistical and logical expressions, supporting a wide range of requirements (e.g., DP or fairness, among others). To ensure high synthetic data quality in the presence of custom specifications, CuTS is pre-trained on the original dataset and fine-tuned on a differentiable loss automatically derived from the provided specifications using novel relaxations. We evaluate CuTS over four datasets and on numerous custom specifications, outperforming state-of-the-art specialized approaches on several tasks while being more general. In particular, at the same fairness level, we achieve 2.3% higher downstream accuracy than the state-of-the-art in fair synthetic data generation on the Adult dataset.
翻译:隐私、数据质量和数据共享问题对表格数据应用构成了关键限制。虽然生成与原始分布相似的合成数据能够部分解决这些问题,但大多数应用还需要对生成数据进行额外的定制化处理。然而,现有的合成数据生成方法仅限于特定约束条件,例如差分隐私(DP)或公平性。本研究提出了CuTS,这是首个可定制的合成表格数据生成框架。CuTS通过声明式统计与逻辑表达式实现定制化功能,支持包括差分隐私和公平性在内的广泛需求。为了在存在定制规范的情况下确保高质量的合成数据,CuTS首先在原始数据集上进行预训练,然后通过基于新型松弛方法从定制规范自动推导的可微分损失函数进行微调。我们在四个数据集上对CuTS进行了多维度定制规范的评估,其在多项任务中优于当前最先进的专用方法,同时具备更强的通用性。特别是在Adult数据集上,在同等公平性水平下,我们的方法比当前最先进的公平合成数据生成方法获得了2.3%的下游准确率提升。