Generative modeling for tabular data has recently gained significant attention in the Deep Learning domain. Its objective is to estimate the underlying distribution of the data. However, estimating the underlying distribution of tabular data has its unique challenges. Specifically, this data modality is composed of mixed types of features, making it a non-trivial task for a model to learn intra-relationships between them. One approach to address mixture is to embed each feature into a continuous matrix via tokenization, while a solution to capture intra-relationships between variables is via the transformer architecture. In this work, we empirically investigate the potential of using embedding representations on tabular data generation, utilizing tensor contraction layers and transformers to model the underlying distribution of tabular data within Variational Autoencoders. Specifically, we compare four architectural approaches: a baseline VAE model, two variants that focus on tensor contraction layers and transformers respectively, and a hybrid model that integrates both techniques. Our empirical study, conducted across multiple datasets from the OpenML CC18 suite, compares models over density estimation and Machine Learning efficiency metrics. The main takeaway from our results is that leveraging embedding representations with the help of tensor contraction layers improves density estimation metrics, albeit maintaining competitive performance in terms of machine learning efficiency.
翻译:表格数据的生成建模近年来在深度学习领域获得了显著关注。其目标是估计数据的底层分布。然而,估计表格数据的底层分布存在其独特的挑战。具体而言,这种数据模态由混合类型的特征组成,使得模型学习特征间的内部关系成为一项非平凡任务。处理混合特征的一种方法是通过标记化将每个特征嵌入到连续矩阵中,而捕获变量间内部关系的一种解决方案是通过Transformer架构。在本工作中,我们通过实证研究探讨了在表格数据生成中使用嵌入表示的潜力,利用张量收缩层和Transformer在变分自编码器内对表格数据的底层分布进行建模。具体而言,我们比较了四种架构方法:一个基线VAE模型、两个分别侧重于张量收缩层和Transformer的变体,以及一个整合了这两种技术的混合模型。我们在OpenML CC18套件的多个数据集上进行的实证研究,通过密度估计和机器学习效率指标对模型进行了比较。我们结果的主要结论是,借助张量收缩层利用嵌入表示改善了密度估计指标,同时在机器学习效率方面保持了有竞争力的性能。