Recent research in tabular data synthesis has focused on single tables, whereas real-world applications often involve complex data with tens or hundreds of interconnected tables. Previous approaches to synthesizing multi-relational (multi-table) data fall short in two key aspects: scalability for larger datasets and capturing long-range dependencies, such as correlations between attributes spread across different tables. Inspired by the success of diffusion models in tabular data modeling, we introduce $\textbf{C}luster$ $\textbf{La}tent$ $\textbf{Va}riable$ $guided$ $\textbf{D}enoising$ $\textbf{D}iffusion$ $\textbf{P}robabilistic$ $\textbf{M}odels$ (ClavaDDPM). This novel approach leverages clustering labels as intermediaries to model relationships between tables, specifically focusing on foreign key constraints. ClavaDDPM leverages the robust generation capabilities of diffusion models while incorporating efficient algorithms to propagate the learned latent variables across tables. This enables ClavaDDPM to capture long-range dependencies effectively. Extensive evaluations on multi-table datasets of varying sizes show that ClavaDDPM significantly outperforms existing methods for these long-range dependencies while remaining competitive on utility metrics for single-table data.
翻译:近年来,表格数据合成的研究主要集中于单表场景,而实际应用往往涉及包含数十乃至数百张互相关联表格的复杂数据。现有的多关系(多表)数据合成方法在两方面存在不足:对大规模数据集的可扩展性,以及对长程依赖关系(例如分布在多张表格间的属性相关性)的捕捉能力。受扩散模型在表格数据建模中取得的成功启发,本文提出基于聚类潜变量引导的去噪扩散概率模型(ClavaDDPM)。该创新方法利用聚类标签作为中间媒介来建模表格间关系,特别聚焦于外键约束的建模。ClavaDDPM在发挥扩散模型强大生成能力的同时,通过高效算法将学习到的潜变量在表格间传播,从而有效捕获长程依赖关系。在不同规模多表数据集上的大量实验表明,ClavaDDPM在长程依赖建模方面显著优于现有方法,同时在单表数据效用指标上保持竞争力。