Hierarchical Conditional Tabular GAN for Multi-Tabular Synthetic Data Generation

The generation of synthetic data is a state-of-the-art approach to leverage when access to real data is limited or privacy regulations limit the usability of sensitive data. A fair amount of research has been conducted on synthetic data generation for single-tabular datasets, but only a limited amount of research has been conducted on multi-tabular datasets with complex table relationships. In this paper we propose the algorithm HCTGAN to synthesize multi-tabular data from complex multi-tabular datasets. We compare our results to the probabilistic model HMA1. Our findings show that our proposed algorithm can more efficiently sample large amounts of synthetic data for deep and complex multi-tabular datasets, whilst achieving adequate data quality and always guaranteeing referential integrity. We conclude that the HCTGAN algorithm is suitable for generating large amounts of synthetic data efficiently for deep multi-tabular datasets with complex relationships. We additionally suggest that the HMA1 model should be used on smaller datasets when emphasis is on data quality.

翻译：合成数据生成是一种在真实数据访问受限或隐私法规限制敏感数据可用性时的先进解决方案。目前已有相当数量的研究专注于单表格数据集的合成数据生成，但对于具有复杂表格关系的多表格数据集，相关研究仍较为有限。本文提出HCTGAN算法，用于从复杂的多表格数据集中合成多表格数据。我们将实验结果与概率模型HMA1进行了对比。研究结果表明，对于深度复杂的多表格数据集，我们提出的算法能够更高效地生成大规模合成数据，同时保证足够的数据质量并始终维持参照完整性。我们得出结论：HCTGAN算法适用于为具有复杂关系的深度多表格数据集高效生成大规模合成数据。此外我们建议，当数据质量成为首要考量时，HMA1模型更适合应用于规模较小的数据集。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日