Database research and development rely heavily on realistic user workloads for benchmarking, instance optimization, migration testing, and database tuning. However, acquiring real-world SQL queries is notoriously challenging due to strict privacy regulations. While cloud database vendors have begun releasing anonymized performance traces to the research community, these traces typi- cally provide only high-level execution statistics without the origi- nal query text or data, which is insufficient for scenarios that require actual execution. Existing tools fail to capture fine-grained perfor- mance patterns or generate runnable workloads that reproduce these public traces with both high fidelity and efficiency. To bridge this gap, we propose ResQ, a fine-grained workload synthesis sys- tem designed to generate executable SQL workloads that faithfully match the per-query execution targets and operator distributions of production traces. ResQ constructs execution-aware query graphs, instantiates them into SQL via Bayesian Optimization-driven pred- icate search, and explicitly models workload repetition through reuse at both exact-query and parameterized-template levels. To ensure practical scalability, ResQ combines search-space bounding with lightweight local cost models to accelerate optimization. Ex- periments on public cloud traces (Snowset, Redset) and a newly released industrial trace (Bendset) demonstrate that ResQ signif- icantly outperforms state-of-the-art baselines, achieving 96.71% token savings and a 86.97% reduction in runtime, while lowering maximum Q-error by 14.8x on CPU time and 997.7x on scanned bytes, and closely matching operator composition.
翻译:数据库的研究与开发高度依赖于真实的用户工作负载,以进行基准测试、实例优化、迁移测试和数据库调优。然而,由于严格的隐私法规,获取真实世界的SQL查询极具挑战性。尽管云数据库供应商已开始向研究社区发布匿名的性能追踪数据,但这些追踪数据通常仅提供高级别的执行统计信息,而缺少原始查询文本或数据,这对于需要实际执行场景的应用而言是不够的。现有工具无法捕获细粒度的性能模式,也无法生成可运行的工作负载,以高保真度和高效率复现这些公开的追踪数据。为弥补这一差距,我们提出了ResQ,一个细粒度的工作负载合成系统,旨在生成可执行的SQL工作负载,使其能够忠实匹配生产追踪数据中每个查询的执行目标与算子分布。ResQ构建执行感知的查询图,通过贝叶斯优化驱动的谓词搜索将其实例化为SQL,并通过在精确查询和参数化模板两个层面上的重用,显式地建模工作负载的重复模式。为确保实际可扩展性,ResQ结合了搜索空间边界与轻量级本地成本模型来加速优化过程。在公开云追踪数据集(Snowset、Redset)和新发布的工业追踪数据集(Bendset)上的实验表明,ResQ显著优于现有最先进的基线方法,实现了96.71%的令牌节省和86.97%的运行时间减少,同时在CPU时间上将最大Q误差降低了14.8倍,在扫描字节数上降低了997.7倍,并紧密匹配了算子组成。