Database research and the development of learned query optimisers rely heavily on realistic SQL workloads. Acquiring real-world queries is increasingly difficult, however, due to strict privacy regulations, and publicly released anonymised traces typically strip out executable query text to preserve confidentiality. Existing synthesis tools fail to bridge this training data gap: traditional benchmarks offer too few fixed templates for statistical generalisation, while Large Language Model (LLM) approaches suffer from schema hallucination fabricating non-existent columns and topological collapse systematically defaulting to simplistic join patterns that fail to stress-test query optimisers. We propose SynQL, a deterministic workload synthesis framework that generates structurally diverse, execution-ready SQL workloads. As a foundational step toward bridging the training-data gap, SynQL targets the core SQL fragment -- multi-table joins with projections, aggregations, and range predicates -- which dominates analytical workloads. SynQL abandons probabilistic text generation in favour of traversing the live database's foreign-key graph to populate an Abstract Syntax Tree (AST), guaranteeing schema and syntactic validity by construction. A configuration vector $Θ$ provides explicit, parametric control over join topology (Star, Chain, Fork), analytical intensity, and predicate selectivity. Experiments on TPC-H and IMDb show that SynQL produces near-maximally diverse workloads (Topological Entropy $H = 1.53$ bits) and that tree-based cost models trained on the synthetic corpus achieve $R^2 \ge 0.79$ on held-out synthetic test sets with sub-millisecond inference latency, establishing SynQL as an effective foundation for generating training data when production logs are inaccessible.
翻译:数据库研究与学习型查询优化器的开发高度依赖于真实的SQL工作负载。然而,由于严格的隐私法规,获取真实世界查询日益困难,而公开发布的匿名化轨迹通常为了保密性而剥离可执行的查询文本。现有的合成工具未能弥合这一训练数据缺口:传统基准测试提供的固定模板过少,难以进行统计泛化;而大语言模型方法则面临模式幻觉(虚构不存在的列)与拓扑崩溃(系统性地默认采用简化连接模式,无法对查询优化器进行压力测试)的问题。我们提出SynQL,一种确定性工作负载合成框架,用于生成结构多样、可立即执行的SQL工作负载。作为弥合训练数据缺口的基础步骤,SynQL针对核心SQL片段——包含投影、聚合与范围谓词的多表连接——这一分析型工作负载中主导的部分。SynQL放弃概率式文本生成,转而遍历实时数据库的外键图以填充抽象语法树,从而在构造上保证模式与语法的有效性。配置向量Θ提供了对连接拓扑(星型、链型、叉型)、分析强度与谓词选择性的显式参数化控制。在TPC-H与IMDb上的实验表明,SynQL能生成接近最大多样性的工作负载(拓扑熵H = 1.53比特),且在合成语料库上训练的基于树的成本模型在保留的合成测试集上达到R² ≥ 0.79,推理延迟低于毫秒级。这证实了SynQL在生产日志不可获取时,作为生成训练数据的有效基础。