We investigate the impact of dataset-specific hyperparameter, feature encoding, and architecture tuning on five recent model families for tabular data generation through an extensive benchmark on 16 datasets. This study addresses the practical need for a unified evaluation of models that fully considers hyperparameter optimization. Additionally, we propose a reduced search space for each model that allows for quick optimization, achieving nearly equivalent performance at a significantly lower cost.Our benchmark demonstrates that, for most models, large-scale dataset-specific tuning substantially improves performance compared to the original configurations. Furthermore, we confirm that diffusion-based models generally outperform other models on tabular data. However, this advantage is not significant when the entire tuning and training process is restricted to the same GPU budget for all models.
翻译:本研究通过16个数据集上的广泛基准测试,系统探究了数据集特定的超参数、特征编码和架构调优对五种近期表格数据生成模型系列的影响。这项工作回应了实践中对充分考虑超参数优化的模型统一评估需求。此外,我们为每个模型提出了可支持快速优化的精简搜索空间,能以显著更低的成本实现近乎等效的性能。我们的基准测试表明,对于大多数模型,相较于原始配置,大规模数据集特定调优能实质性提升性能。进一步地,我们证实基于扩散的模型通常在表格数据上优于其他模型。然而,当所有模型的完整调优和训练过程被限制在相同的GPU计算预算时,这一优势并不显著。