In data-driven systems, data exploration is imperative for making real-time decisions. However, big data is stored in massive databases that are difficult to retrieve. Approximate Query Processing (AQP) is a technique for providing approximate answers to aggregate queries based on a summary of the data (synopsis) that closely replicates the behavior of the actual data, which can be useful where an approximate answer to the queries would be acceptable in a fraction of the real execution time. This study explores the novel utilization of Generative Adversarial Networks (GANs) in the generation of tabular data that can be employed in AQP for synopsis construction. We thoroughly investigate the unique challenges posed by the synopsis construction process, including maintaining data distribution characteristics, handling bounded continuous and categorical data, and preserving semantic relationships and then introduce the advancement of tabular GAN architectures that overcome these challenges. Furthermore, we propose and validate a suite of statistical metrics tailored for assessing the reliability of the GAN-generated synopses. Our findings demonstrate that advanced GAN variations exhibit a promising capacity to generate high-fidelity synopses, potentially transforming the efficiency and effectiveness of AQP in data-driven systems.
翻译:在数据驱动系统中,数据探索对于实时决策至关重要。然而,大数据存储于难以检索的海量数据库中。近似查询处理(AQP)是一种基于数据摘要(概要)为聚合查询提供近似答案的技术,该摘要能紧密复现实际数据的行为,在可接受近似查询答案且执行时间仅为真实时间的极小部分时具有实用价值。本研究探索了生成对抗网络(GAN)在表格数据生成中的创新应用,将其用于AQP中的概要构建。我们深入研究了概要构建过程带来的独特挑战,包括保持数据分布特征、处理有界连续数据与分类数据以及保留语义关联,进而介绍了克服这些挑战的表格GAN架构的进展。此外,我们提出并验证了一套专用于评估GAN生成概要可靠性的统计指标。研究结果表明,先进的GAN变体在生成高保真度概要方面展现出极佳潜力,有望提升数据驱动系统中AQP的效率和有效性。