Synthetic tabular data generation has traditionally been a challenging problem due to the high complexity of the underlying distributions that characterise this type of data. Despite recent advances in deep generative models (DGMs), existing methods often fail to produce realistic datapoints that are well-aligned with available background knowledge. In this paper, we address this limitation by introducing Disjunctive Refinement Layer (DRL), a novel layer designed to enforce the alignment of generated data with the background knowledge specified in user-defined constraints. DRL is the first method able to automatically make deep learning models inherently compliant with constraints as expressive as quantifier-free linear formulas, which can define non-convex and even disconnected spaces. Our experimental analysis shows that DRL not only guarantees constraint satisfaction but also improves efficacy in downstream tasks. Notably, when applied to DGMs that frequently violate constraints, DRL eliminates violations entirely. Further, it improves performance metrics by up to 21.4% in F1-score and 20.9% in Area Under the ROC Curve, thus demonstrating its practical impact on data generation.
翻译:合成表格数据生成传统上一直是一个具有挑战性的问题,这源于表征此类数据的底层分布的高度复杂性。尽管深度生成模型(DGMs)近期取得了进展,但现有方法往往无法生成与可用背景知识良好对齐的现实数据点。在本文中,我们通过引入析取精化层(DRL)来解决这一局限,这是一种新颖的层,旨在强制生成的用户定义约束所指定的背景知识对齐。DRL是首个能够自动使深度学习模型固有地符合表达能力等同于无量化词线性公式约束的方法,这些公式可以定义非凸甚至非连通的空间。我们的实验分析表明,DRL不仅保证了约束满足,还提高了下游任务的效能。值得注意的是,当应用于经常违反约束的DGMs时,DRL完全消除了违规。此外,它将F1分数和ROC曲线下面积的性能指标分别提高了高达21.4%和20.9%,从而证明了其在数据生成方面的实际影响。