The ability to train generative models that produce realistic, safe and useful tabular data is essential for data privacy, imputation, oversampling, explainability or simulation. However, generating tabular data is not straightforward due to its heterogeneity, non-smooth distributions, complex dependencies and imbalanced categorical features. Although diverse methods have been proposed in the literature, there is a need for a unified evaluation, under the same conditions, on a variety of datasets. This study addresses this need by fully considering the optimization of: hyperparameters, feature encodings, and architectures. We investigate the impact of dataset-specific tuning on five recent model families for tabular data generation through an extensive benchmark on 16 datasets. These datasets vary in terms of size (an average of 80,000 rows), data types, and domains. We also propose a reduced search space for each model that allows for quick optimization, achieving nearly equivalent performance at a significantly lower cost. Our benchmark demonstrates that, for most models, large-scale dataset-specific tuning substantially improves performance compared to the original configurations. Furthermore, we confirm that diffusion-based models generally outperform other models on tabular data. However, this advantage is not significant when the entire tuning and training process is restricted to the same GPU budget.
翻译:训练能够生成真实、安全且实用的表格数据的生成模型,对于数据隐私保护、数据填补、过采样、可解释性分析或仿真模拟至关重要。然而,由于表格数据具有异构性、非平滑分布、复杂的依赖关系以及不平衡的类别特征,生成表格数据并非易事。尽管文献中已提出了多种方法,但仍需在相同条件下,对各种数据集进行统一的评估。本研究通过全面考虑超参数、特征编码和架构的优化,来满足这一需求。我们通过在16个数据集上进行广泛的基准测试,研究了数据集特定调优对五种近期表格数据生成模型系列的影响。这些数据集在规模(平均约80,000行)、数据类型和应用领域方面各不相同。我们还为每个模型提出了一个缩减的搜索空间,允许进行快速优化,以显著更低的成本实现几乎等效的性能。我们的基准测试表明,对于大多数模型,与原始配置相比,大规模的数据集特定调优能显著提升性能。此外,我们证实了基于扩散的模型在表格数据上通常优于其他模型。然而,当整个调优和训练过程被限制在相同的GPU预算内时,这种优势并不显著。