Data synthesis is gaining momentum as a privacy-enhancing technology. While single-table tabular data generation has seen considerable progress, current methods for multi-table data often lack the flexibility and expressiveness needed to capture complex relational structures. In particular, they struggle with long-range dependencies and complex foreign-key relationships, such as tables with multiple parent tables or multiple types of links between the same pair of tables. We propose a generative model for relational data that generates the content of a relational dataset given the graph formed by the foreign-key relationships. We do this by learning a deep generative model of the content of the whole relational database by flow matching, where the neural network trained to denoise records leverages a graph neural network to obtain information from connected records. Our method is flexible, as it can support relational datasets with complex structures, and expressive, as the generation of each record can be influenced by any other record within the same connected component. We evaluate our method on several benchmark datasets and show that it achieves state-of-the-art performance in terms of synthetic data fidelity.
翻译:数据合成作为一种隐私增强技术正日益受到关注。尽管单表表格数据生成已取得显著进展,但当前多表数据生成方法往往缺乏捕捉复杂关系结构所需的灵活性与表达能力。特别是在处理长程依赖和复杂外键关系(例如具有多个父表或同一对表之间存在多种链接类型的表)时,现有方法存在明显不足。本文提出一种关系数据生成模型,该模型能够根据外键关系构成的图结构生成关系数据集的内容。我们通过流匹配技术学习整个关系数据库内容的深度生成模型,其中用于记录去噪训练的神经网络利用图神经网络从关联记录中获取信息。本方法具有灵活性,可支持具有复杂结构的关系数据集;同时具备强表达能力,每个记录的生成可受同一连通分量内任意其他记录的影响。我们在多个基准数据集上评估了本方法,结果表明其在合成数据保真度方面达到了最先进的性能水平。