Diffusion model has become a main paradigm for synthetic data generation in many subfields of modern machine learning, including computer vision, language model, or speech synthesis. In this paper, we leverage the power of diffusion model for generating synthetic tabular data. The heterogeneous features in tabular data have been main obstacles in tabular data synthesis, and we tackle this problem by employing the auto-encoder architecture. When compared with the state-of-the-art tabular synthesizers, the resulting synthetic tables from our model show nice statistical fidelities to the real data, and perform well in downstream tasks for machine learning utilities. We conducted the experiments over $15$ publicly available datasets. Notably, our model adeptly captures the correlations among features, which has been a long-standing challenge in tabular data synthesis. Our code is available at https://github.com/UCLA-Trustworthy-AI-Lab/AutoDiffusion.
翻译:扩散模型已成为现代机器学习多个子领域(包括计算机视觉、语言模型和语音合成)中合成数据生成的主流范式。本文利用扩散模型生成合成表格数据。表格数据中异质性特征一直是该领域的主要障碍,我们通过引入自编码器架构解决这一问题。与最先进的表格数据合成器相比,我们的模型生成的合成表格在统计保真度方面与实际数据表现良好,且在面向机器学习效用的下游任务中效果显著。我们在$15$个公开数据集上进行了实验。值得注意的是,我们的模型能够精准捕捉特征间的相关性,这曾是表格数据合成领域长期存在的挑战。相关代码已开源至https://github.com/UCLA-Trustworthy-AI-Lab/AutoDiffusion。