Over recent years, denoising diffusion generative models have come to be considered as state-of-the-art methods for synthetic data generation, especially in the case of generating images. These approaches have also proved successful in other applications such as tabular and graph data generation. However, due to computational complexity, to this date, the application of these techniques to graph data has been restricted to small graphs, such as those used in molecular modeling. In this paper, we propose SaGess, a discrete denoising diffusion approach, which is able to generate large real-world networks by augmenting a diffusion model (DiGress) with a generalized divide-and-conquer framework. The algorithm is capable of generating larger graphs by sampling a covering of subgraphs of the initial graph in order to train DiGress. SaGess then constructs a synthetic graph using the subgraphs that have been generated by DiGress. We evaluate the quality of the synthetic data sets against several competitor methods by comparing graph statistics between the original and synthetic samples, as well as evaluating the utility of the synthetic data set produced by using it to train a task-driven model, namely link prediction. In our experiments, SaGess, outperforms most of the one-shot state-of-the-art graph generating methods by a significant factor, both on the graph metrics and on the link prediction task.
翻译:摘要:近年来,去噪扩散生成模型已被视为合成数据生成的前沿方法,尤其在图像生成领域。该类方法在表格数据与图数据生成等其他应用中同样展现出成功。然而,受限于计算复杂度,目前这些技术在图数据中的应用仍局限于小规模图(如分子建模所用图)。本文提出SaGess,一种离散去噪扩散方法,通过将扩散模型(DiGress)与广义分治框架结合,能够生成大规模真实网络。该算法通过采样初始图的子图覆盖来训练DiGress,从而生成更大规模图。SaGess利用DiGress生成的子图构建合成图。我们通过比较原始样本与合成样本的图统计指标,以及利用生成的数据集训练任务驱动模型(即链路预测)来评估其效用,从而检验合成数据集的质量。实验表明,SaGess在图统计指标和链路预测任务上均显著优于多数单步式前沿图生成方法。