Under the Hood of Tabular Data Generation Models: Benchmarks with Extensive Tuning

The ability to train generative models that produce realistic, safe and useful tabular data is essential for data privacy, imputation, oversampling, explainability or simulation. However, generating tabular data is not straightforward due to its heterogeneity, non-smooth distributions, complex dependencies and imbalanced categorical features. Although diverse methods have been proposed in the literature, there is a need for a unified evaluation, under the same conditions, on a variety of datasets. This study addresses this need by fully considering the optimization of: hyperparameters, feature encodings, and architectures. We investigate the impact of dataset-specific tuning on five recent model families for tabular data generation through an extensive benchmark on 16 datasets. These datasets vary in terms of size (an average of 80,000 rows), data types, and domains. We also propose a reduced search space for each model that allows for quick optimization, achieving nearly equivalent performance at a significantly lower cost. Our benchmark demonstrates that, for most models, large-scale dataset-specific tuning substantially improves performance compared to the original configurations. Furthermore, we confirm that diffusion-based models generally outperform other models on tabular data. However, this advantage is not significant when the entire tuning and training process is restricted to the same GPU budget.

翻译：训练能够生成真实、安全且实用的表格数据的生成模型，对于数据隐私保护、数据填补、过采样、可解释性分析或仿真模拟至关重要。然而，由于表格数据具有异构性、非平滑分布、复杂的依赖关系以及不平衡的类别特征，生成表格数据并非易事。尽管文献中已提出了多种方法，但仍需在相同条件下，对各种数据集进行统一的评估。本研究通过全面考虑超参数、特征编码和架构的优化，来满足这一需求。我们通过在16个数据集上进行广泛的基准测试，研究了数据集特定调优对五种近期表格数据生成模型系列的影响。这些数据集在规模（平均约80,000行）、数据类型和应用领域方面各不相同。我们还为每个模型提出了一个缩减的搜索空间，允许进行快速优化，以显著更低的成本实现几乎等效的性能。我们的基准测试表明，对于大多数模型，与原始配置相比，大规模的数据集特定调优能显著提升性能。此外，我们证实了基于扩散的模型在表格数据上通常优于其他模型。然而，当整个调优和训练过程被限制在相同的GPU预算内时，这种优势并不显著。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日