Synthetic tabular data is becoming a necessity as concerns about data privacy intensify in the world. Tabular data can be useful for testing various systems, simulating real data, analyzing the data itself or building predictive models. Unfortunately, such data may not be available due to confidentiality issues. Previous techniques, such as TVAE (Xu et al., 2019) or OCTGAN (Kim et al., 2021), are either unable to handle particularly complex datasets, or are complex in themselves, resulting in inferior run time performance. This paper introduces PSVAE, a new simple model that is capable of producing high-quality synthetic data in less run time. PSVAE incorporates two key ideas: loss optimization and post-selection. Along with these ideas, the proposed model compensates for underrepresented categories and uses a modern activation function, Mish (Misra, 2019).
翻译:随着全球范围内对数据隐私问题的日益关注,合成表格数据正逐渐成为一种必要手段。表格数据可用于测试各类系统、模拟真实数据、进行数据分析或构建预测模型。然而,由于保密性问题,此类数据往往难以获取。现有技术如TVAE(Xu等人,2019)或OCTGAN(Kim等人,2021)要么无法处理特别复杂的数据集,要么自身结构复杂,导致运行时性能不佳。本文提出PSVAE,这是一种新颖的简洁模型,能够在更短的运行时间内生成高质量的合成数据。PSVAE融合了两个核心思想:损失优化与后选择机制。基于这些思想,该模型能够补偿代表性不足的类别,并采用现代激活函数Mish(Misra,2019)。