Navigating Tabular Data Synthesis Research: Understanding User Needs and Tool Capabilities

In an era of rapidly advancing data-driven applications, there is a growing demand for data in both research and practice. Synthetic data have emerged as an alternative when no real data is available (e.g., due to privacy regulations). Synthesizing tabular data presents unique and complex challenges, especially handling (i) missing values, (ii) dataset imbalance, (iii) diverse column types, and (iv) complex data distributions, as well as preserving (i) column correlations, (ii) temporal dependencies, and (iii) integrity constraints (e.g., functional dependencies) present in the original dataset. While substantial progress has been made recently in the context of generational models, there is no one-size-fits-all solution for tabular data today, and choosing the right tool for a given task is therefore no trivial task. In this paper, we survey the state of the art in Tabular Data Synthesis (TDS), examine the needs of users by defining a set of functional and non-functional requirements, and compile the challenges associated with meeting those needs. In addition, we evaluate the reported performance of 36 popular research TDS tools about these requirements and develop a decision guide to help users find suitable TDS tools for their applications. The resulting decision guide also identifies significant research gaps.

翻译：在数据驱动应用快速发展的时代，研究与实践中对数据的需求日益增长。当无法获取真实数据时（例如受隐私法规限制），合成数据已成为一种替代方案。表格数据的合成面临着独特而复杂的挑战，特别是需要处理（i）缺失值、（ii）数据集不平衡、（iii）多样化的列类型以及（iv）复杂的数据分布，同时还需保持原始数据集中存在的（i）列相关性、（ii）时间依赖性以及（iii）完整性约束（如函数依赖）。尽管生成模型领域近期已取得显著进展，但目前仍不存在适用于所有表格数据的通用解决方案，因此为特定任务选择合适的工具并非易事。本文系统综述了表格数据合成（TDS）的研究现状，通过定义一组功能性与非功能性需求来剖析用户需求，并梳理了满足这些需求所面临的挑战。此外，我们基于这些需求评估了36种主流TDS研究工具的报告性能，开发出一套决策指南以帮助用户为其应用场景匹配合适的TDS工具。该决策指南同时揭示了重要的研究空白领域。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日