Deep Generative Models (DGMs) have been shown to be powerful tools for generating tabular data, as they have been increasingly able to capture the complex distributions that characterize them. However, to generate realistic synthetic data, it is often not enough to have a good approximation of their distribution, as it also requires compliance with constraints that encode essential background knowledge on the problem at hand. In this paper, we address this limitation and show how DGMs for tabular data can be transformed into Constrained Deep Generative Models (C-DGMs), whose generated samples are guaranteed to be compliant with the given constraints. This is achieved by automatically parsing the constraints and transforming them into a Constraint Layer (CL) seamlessly integrated with the DGM. Our extensive experimental analysis with various DGMs and tasks reveals that standard DGMs often violate constraints, some exceeding $95\%$ non-compliance, while their corresponding C-DGMs are never non-compliant. Then, we quantitatively demonstrate that, at training time, C-DGMs are able to exploit the background knowledge expressed by the constraints to outperform their standard counterparts with up to $6.5\%$ improvement in utility and detection. Further, we show how our CL does not necessarily need to be integrated at training time, as it can be also used as a guardrail at inference time, still producing some improvements in the overall performance of the models. Finally, we show that our CL does not hinder the sample generation time of the models.
翻译:深度生成模型已被证明是生成表格数据的强大工具,因为它们日益能够捕捉表征表格数据的复杂分布。然而,要生成逼真的合成数据,仅靠对分布的良好近似往往不够,还需要满足编码问题领域关键背景知识的约束条件。本文针对这一局限性展开研究,展示了如何将面向表格数据的深度生成模型转化为约束深度生成模型,使其生成的样本保证符合给定约束。这通过自动解析约束条件并将其转化为与深度生成模型无缝集成的约束层来实现。我们使用多种深度生成模型在不同任务上进行了广泛的实验分析,结果表明标准深度生成模型经常违反约束条件,部分模型的不合规率超过95%,而对应的约束深度生成模型从未出现不合规情况。此外,我们定量证明,在训练阶段,约束深度生成模型能够利用约束条件所表达的背景知识,在效用性和检测能力上较标准模型提升高达6.5%。进一步,我们展示了约束层不必在训练阶段集成,也可作为推理阶段的防护栏使用,仍能在模型整体性能上产生一定改善。最后,我们证明约束层不会影响模型的样本生成时间。