How Realistic Is Your Synthetic Data? Constraining Deep Generative Models for Tabular Data

Deep Generative Models (DGMs) have been shown to be powerful tools for generating tabular data, as they have been increasingly able to capture the complex distributions that characterize them. However, to generate realistic synthetic data, it is often not enough to have a good approximation of their distribution, as it also requires compliance with constraints that encode essential background knowledge on the problem at hand. In this paper, we address this limitation and show how DGMs for tabular data can be transformed into Constrained Deep Generative Models (C-DGMs), whose generated samples are guaranteed to be compliant with the given constraints. This is achieved by automatically parsing the constraints and transforming them into a Constraint Layer (CL) seamlessly integrated with the DGM. Our extensive experimental analysis with various DGMs and tasks reveals that standard DGMs often violate constraints, some exceeding $95\%$ non-compliance, while their corresponding C-DGMs are never non-compliant. Then, we quantitatively demonstrate that, at training time, C-DGMs are able to exploit the background knowledge expressed by the constraints to outperform their standard counterparts with up to $6.5\%$ improvement in utility and detection. Further, we show how our CL does not necessarily need to be integrated at training time, as it can be also used as a guardrail at inference time, still producing some improvements in the overall performance of the models. Finally, we show that our CL does not hinder the sample generation time of the models.

翻译：深度生成模型已被证明是生成表格数据的强大工具，因为它们日益能够捕捉表征表格数据的复杂分布。然而，要生成逼真的合成数据，仅靠对分布的良好近似往往不够，还需要满足编码问题领域关键背景知识的约束条件。本文针对这一局限性展开研究，展示了如何将面向表格数据的深度生成模型转化为约束深度生成模型，使其生成的样本保证符合给定约束。这通过自动解析约束条件并将其转化为与深度生成模型无缝集成的约束层来实现。我们使用多种深度生成模型在不同任务上进行了广泛的实验分析，结果表明标准深度生成模型经常违反约束条件，部分模型的不合规率超过95%，而对应的约束深度生成模型从未出现不合规情况。此外，我们定量证明，在训练阶段，约束深度生成模型能够利用约束条件所表达的背景知识，在效用性和检测能力上较标准模型提升高达6.5%。进一步，我们展示了约束层不必在训练阶段集成，也可作为推理阶段的防护栏使用，仍能在模型整体性能上产生一定改善。最后，我们证明约束层不会影响模型的样本生成时间。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日