Many problems in database systems, such as cardinality estimation, database testing and optimizer tuning, require a large query load as data. However, it is often difficult to obtain a large number of real queries from users due to user privacy restrictions or low frequency of database access. Query generation is one of the approaches to solve this problem. Existing query generation methods, such as random generation and template-based generation, do not consider the relationship between the generated queries and existing queries, or even generate semantically incorrect queries. In this paper, we propose a query generation framework based on generative adversarial networks (GAN) to generate query load that is similar to the given query load. In our framework, we use a syntax parser to transform the query into a parse tree and traverse the tree to obtain the sequence of production rules corresponding to the query. The generator of GAN takes a fixed distribution prior as input and outputs the query sequence, and the discriminator takes the real query and the fake query generated by the generator as input and outputs a gradient to guide the generator learning. In addition, we add context-free grammar and semantic rules to the generation process, which ensures that the generated queries are syntactically and semantically correct. We conduct experiments to evaluate our approach on real-world dataset, which show that our approach can generate new query loads with a similar distribution to a given query load, and that the generated queries are syntactically correct with no semantic errors. The generated query loads are used in downstream task, and the results show a significant improvement in the models trained with the expanded query loads using our approach.
翻译:数据库系统中的许多问题,如基数估计、数据库测试和优化器调优,都需要大量查询负载作为数据。然而,由于用户隐私限制或数据库访问频率较低,通常难以从用户处获取大量真实查询。查询生成是解决该问题的方法之一。现有的查询生成方法(如随机生成和基于模板的生成)未考虑生成查询与现有查询之间的关系,甚至可能生成语义错误的查询。本文提出了一种基于生成对抗网络(GAN)的查询生成框架,用于生成与给定查询负载相似的查询负载。在该框架中,我们使用语法解析器将查询转换为解析树,并遍历该树以获取查询对应的产生式规则序列。GAN的生成器以固定先验分布作为输入,输出查询序列;判别器以真实查询和生成器生成的虚假查询作为输入,输出梯度以指导生成器学习。此外,我们在生成过程中添加了上下文无关语法和语义规则,确保生成的查询在语法和语义上正确。我们在真实数据集上进行了实验评估,结果表明:我们的方法能够生成与给定查询负载分布相似的新查询负载,且生成的查询语法正确、无语义错误。将生成的查询负载用于下游任务时,使用本方法扩展的查询负载训练的模型表现出了显著提升。