Tabular data synthesis is a long-standing research topic in machine learning. Many different methods have been proposed over the past decades, ranging from statistical methods to deep generative methods. However, it has not always been successful due to the complicated nature of real-world tabular data. In this paper, we present a new model named Score-based Tabular data Synthesis (STaSy) and its training strategy based on the paradigm of score-based generative modeling. Despite the fact that score-based generative models have resolved many issues in generative models, there still exists room for improvement in tabular data synthesis. Our proposed training strategy includes a self-paced learning technique and a fine-tuning strategy, which further increases the sampling quality and diversity by stabilizing the denoising score matching training. Furthermore, we also conduct rigorous experimental studies in terms of the generative task trilemma: sampling quality, diversity, and time. In our experiments with 15 benchmark tabular datasets and 7 baselines, our method outperforms existing methods in terms of task-dependant evaluations and diversity.
翻译:表格数据合成是机器学习领域一个长期研究的课题。过去几十年中,研究者提出了多种方法,从统计方法到深度生成方法均有涉及。然而,由于真实世界表格数据的复杂特性,这些方法并不总能成功。本文提出了一种名为基于分数的表格数据合成(STaSy)的新模型及其基于分数生成建模范式的训练策略。尽管分数生成模型已解决了生成模型中的诸多问题,但在表格数据合成方面仍有改进空间。我们提出的训练策略包含自步学习技术和微调策略,通过稳定去噪分数匹配训练,进一步提升了采样质量和多样性。此外,我们还从生成任务的三大困境——采样质量、多样性和时间——角度进行了严格的实验研究。在15个基准表格数据集和7个基线模型的实验中,我们的方法在任务相关评估和多样性方面均优于现有方法。